Solaris/Linux Performance Measurement and Tuning

Solaris/Linux Performance Measurement and Tuning
Adrian Cockcroft, April 20, 2017 4/20/2017

Abstract This course focuses on the measurement sources and tuning parameters available in Unix and Linux, including TCP/IP measurement and tuning, complex storage subsystems, and with a deep dive on advanced Solaris metrics such as microstates and extended system accounting. The meaning and behavior of metrics is covered in detail. Common fallacies, misleading indicators, sources of measurement error and other traps for the unwary will be exposed. Free tools for Capacity Planning are covered in detail by this presenter in a separate Usenix Workshop. Solaris/Linux Performance Measurement and Tuning 4/20/2017

Sources Adrian Cockcroft
Sun Microsystems , Distinguished Engineer eBay Research Labs , Distinguished Engineer Netflix 2007, Director - Web Engineering Note: I am a Netflix employee, but this material does not refer to and is not endorsed by Netflix. It is based on the author's work over the last 20 years. CMG Papers and Sunday Workshops by the author - see Unix CPU Time Measurement Errors - (Best paper 1998) TCP/IP Tutorial - Sunday Workshop Capacity Planning - Sunday Workshop Grid Tutorial - Sunday Workshop Capacity Planning with Free Tools - Sunday Workshop Books by the author Sun Performance and Tuning, Prentice Hall, 1994, 1998 (2nd Ed) Resource Management, Prentice Hall, 2000 Capacity Planning for Internet Services, Prentice Hall, 2001 Solaris/Linux Performance Measurement and Tuning 4/20/2017

Contents Capacity Planning Definitions Metric collection interfaces
Process - microstate and extended accounting CPU - measurement issues Network - Internet Servers and TCP/IP Disks - iostat, simple disks and RAID Memory Quick tips and Recipes References Solaris/Linux Performance Measurement and Tuning 4/20/2017

Definitions Solaris/Linux Performance Measurement and Tuning 4/20/2017

Capacity Planning Definitions
Resource utilization and headroom Planning Predicting future needs by analyzing historical data and modeling future scenarios Performance Monitoring Collecting and reporting on performance data Unix/Linux (apologies to users of OSX, HP-UX, AIX etc.) Emphasis on Solaris since it is a comprehensively instrumented and full featured Unix Linux is mostly a subset Solaris/Linux Performance Measurement and Tuning 4/20/2017

Measurement Terms and Definitions
Bandwidth - gross work per unit time [unattainable] Throughput - net work per unit time Peak throughput - at maximum acceptable response time Response time - time to complete a unit of work including waiting Service time - time to process a unit of work after waiting Queue length - number of requests waiting Utilization - busy time relative to elapsed time [can be misleading] Rule of thumb: Estimate 95th percentile response time as three times mean response time Solaris/Linux Performance Measurement and Tuning 4/20/2017

Capacity Planning Requirements
We care about CPU, Memory, Network and Disk resources, and Application response times We need to know how much of each resource we are using now, and will use in the future We need to know how much headroom we have to handle higher loads We want to understand how headroom varies, and how it relates to application response times and throughput We want to be able to find the bottleneck in an under-performing system Solaris/Linux Performance Measurement and Tuning 4/20/2017

Metrics Solaris/Linux Performance Measurement and Tuning 4/20/2017

Measurement Data Interfaces
Several generic raw access methods Read the kernel directly Structured system data Process data Network data Accounting data Application data Command based data interfaces Scrape data from vmstat, iostat, netstat, sar, ps Higher overhead, lower resolution, missing metrics Data available is platform and release specific either way Solaris/Linux Performance Measurement and Tuning 4/20/2017

Reading kernel memory - kvm
The only way to get data in very old Unix variants Use kernel namelist symbol table and open /dev/kmem Solaris wraps up interface in kvm library Advantages Still the only way to get at some kinds of data Low overhead, fast bulk data capture Disadvantages Too much intimate implementation detail exposed No locking protection to ensure consistent data Highly non-portable, unstable over releases and patches Tools break when kernel moves between 32 and 64bit address support Solaris/Linux Performance Measurement and Tuning 4/20/2017

Structured Kernel Statistics - kstat
Solaris 2 introduced kstat and extended usage in each release Used by Solaris 2 vmstat, iostat, sar, network interface stats, etc. Advantages The recommended and supported Solaris metric access API Does not require setuid root commands to access for reads Individual named metrics stable over releases Consistent data using locking, but low overhead Unchanged when kernel moves to 64bit address support Extensible to add metrics without breaking existing code Disadvantages Somewhat complex hierarchical kstat_chain structure State changes (device online/offline) cause kstat_chain rebuild Solaris/Linux Performance Measurement and Tuning 4/20/2017

Kernel Trace - TNF, Dtrace, ktrace
Solaris, Linux, Windows and other Unixes have similar features Solaris has TNF probes and prex command to control them User level probe library for hires tracepoints allows instrumentation of multithreaded applications Kernel level probes allow disk I/O and scheduler tracing Advantages Low overhead, microsecond resolution I/O trace capability is extremely useful Disadvantages Too much data to process with simple tracing capabilities Trace buffer can overflow or cause locking issues Solaris 10 Dtrace is a quite different beast! Much more flexible Solaris/Linux Performance Measurement and Tuning 4/20/2017

Dtrace – Dynamic Tracing
One of the most exiting new features in Solaris 10, rave reviews Book: "Solaris Performance and Tools" by Richard McDougall and Brendan Gregg Advantages No overhead when it is not in use Low overhead probes can be put anywhere/everywhere Trace data is correlated and filtered at source, get exactly the data you want, very sophisticated data providers included Bundled, supported, designed to be safe for production systems Disadvantages Solaris specific, but being ported to BSD/Linux No high level tools support yet Yet another scripting language to learn – somewhat similar to “awk” Solaris/Linux Performance Measurement and Tuning 4/20/2017

Hardware counters Solaris cpustat for X86 and UltraSPARC pipeline and cache counters Solaris busstat for server backplanes and I/O buses, corestat for multi-core systems Intel Trace Collector, Vampir for Linux Most modern CPUs and systems have counters Advantages See what is really happening, more accurate than kernel stats Cache usage useful for tuning code algorithms Pipeline usage useful for HPC tuning for megaflops Backplane and memory bank usage useful for database servers Disadvantages Raw data is confusing, lots of architectural background info needed Most tools focus on developer code tuning Solaris/Linux Performance Measurement and Tuning 4/20/2017

Configuration information
Configuration data comes from too many sources! Solaris device tree displayed by prtconf and prtdiag Solaris 8 adds dynamic configuration notification device picld SunVTS component test system has vtsprobe to get config SCSI device info using iostat -E in Solaris Logical volume info from product specific vxprint and metastat Hardware RAID info from product specific tools Critical storage config info must be accessed over ethernet… It is very hard to combine all this data! DMTF CIM objects try to address this, but no-one seems to use them… Free tool - Config Engine: Solaris/Linux Performance Measurement and Tuning 4/20/2017

Application instrumentation Examples
Oracle V$ Tables – detailed metrics used by many tools ARM standard instrumentation Custom do-it-yourself and log file scraping Advantages Focussed application specific information Business metrics needed to do real capacity planning Disadvantages No common access methods ARM is a collection interface only, vendor specific tools, data Very few applications are instrumented, even fewer have support from performance tools vendors Solaris/Linux Performance Measurement and Tuning 4/20/2017

Kernel values, tunables and defaults
There is often far too much emphasis on kernel tweaks There really are few “magic bullet” tunables It rarely makes a significant difference Fix the system configuration or tune the application instead! Very few adjustable components “No user serviceable parts inside” But Unix has so much history people think it is like a 70’s car Solaris really is dynamic, adaptive and self-tuning Most other “traditional Unix” tunables are just advisory limits Tweaks may be workarounds for bugs/problems Patch or OS release removes the problem - remove the tweak Solaris Tunable Parameters Reference Manual (if you must…) Tuning the kernel is a hard subject to deal with. Some tunables are well known and easy to explain. Others are more complex or change from one release to the next. The settings in use are often based on out-of-date folklore. In normal use there is no need to tune the Solaris 2 kernel; it dynamically adapts itself to the hardware configuration and the application workload. If it isn’t working properly, you may have a configuration error, a hardware failure, or a software bug. To fix a problem, you check the configuration, make sure all the hardware is OK, and load the latest software patches. Some tunable parameters configure the sizes or limits of some data structures. The size of these data structures has no effect on performance, but if they are set too low, an application might not run at all. Configuring shared memory allocations for databases falls into this category. Kernel configuration and tuning variables are normally edited into the /etc./system file by hand. Unfortunately, any kernel data that has a symbol can be set via this file at boot time, whether or not it is a documented tunable. So why is there so much emphasis on kernel tuning? And why are there such high expectations of the performance boost available from kernel tweaks? I think the reasons are historical, and I’ll use a car analogy to explain it. Compare a 1970s car with a 1998 car. The older car has a carburetor, needs regular tune-ups, and is likely to be temperamental at best. The 1998 car has computerized fuel injection, self-adjusting engine components, and is easier to live with, consistent and reliable. If the old car won’t start reliably, you get out the workshop manual and tinker with a large number of fine adjustments. The 1998 car’s computerized ignition and fuel injection systems have no user serviceable components. Unix started out in an environment where the end users had source code and did their own tuning and support. As Unix became a commercial platform for running applications, the end users changed. Tinkering with the operating system is a distraction. SunSoft engineers have put a lot of effort into automating the tuning for Solaris 2. It adaptively scales according to the hardware capabilities and the workload it is running. The self-configuring and tuning nature of Solaris contributes to its ease of use and greatly reduces the gains from tweaking it yourself. Each successive version of Solaris 2 has removed tuning variables by converting hand-adjusted values into adaptively managed limits. Solaris/Linux Performance Measurement and Tuning 4/20/2017

SE Toolkit and XE Toolkit
Solaris/Linux Performance Measurement and Tuning 4/20/2017

SE toolkit Example Tools
Open Source performance toolkit for rapidly creating custom data sources Makes all the very extensive Solaris metrics easily available Solaris specific metrics fully supported, no other OS support (see XEtoolkit…) Written by Rich Pettit with contributions from Adrian Cockcroft Get SE3.4.1 from Support for SPARC & x86 Solaris 8, 9, 10, OpenSolaris Function Example SE Programs Rule Monitors cpg.se monlog.se mon_cm.se live_test.se percollator.se zoom.se virtual_adrian.se virtual_adrian_lite.se Disk Monitors siostat.se xio.se xiostat.se iomonitor.se iost.se xit.se disks.se CPU Monitors cpu_meter.se vmmonitor.se mpvmstat.se Process Monitors msacct.se pea.se ps-ax.se ps-p.se pwatch.se pw.se Network Monitors net.se tcp_monitor.se netmonitor.se netstatx.se nfsmonitor.se nx.se Clones iostat.se uname.se vmstat.se nfsstat-m.se perfmeter.se xload.se Data browsers aw.se infotool.se multi_meter.se Contributed Code anasa dfstats kview systune watch orcollator.se Test Programs syslog.se cpus.se pure_test.se collisions.se uptime.se dumpkstats.se net_example nproc.se kvmname.se Solaris/Linux Performance Measurement and Tuning 4/20/2017

SE language features Not a new language to learn from scratch!
SE is a 64bit interpreted dialect of C Not a new language to learn from scratch! Standard C /usr/ccs/bin/cpp used at runtime to preprocess SE scripts Main omissions - pointer types and goto Main additions - classes and “string” type powerful ways to handle dynamically allocated data built-in fast balanced tree routines for storing key indexed data Dynamic linking to all existing C libraries Built-in classes access kernel data Supplied class code hides details, provides the data you want Example scripts improve on basic utilities e.g. siostat.se, nx.se, pea.se Example rule based monitors e.g. virtual_adrian.se, orcallator.se Solaris/Linux Performance Measurement and Tuning 4/20/2017

"virtual adrian" rules summary
Disk Rule for all disks at once Looks for slow disks and unbalanced usage Network Rule for all networks at once Looks for slow nets and unbalanced usage Swap Rule - Looks for lack of available swap space RAM Rule - Looks for short page residence times CPU Power Rule Scales on MP systems Looks for long run queue delays Mutex Rule - Looks for kernel lock contention and high sys CPU time TCP Rule Looks for listen queue problems Reports on connection attempt failures Solaris/Linux Performance Measurement and Tuning 4/20/2017

Orca Data Collector Simple free and configurable
Comprehensive Solaris data collector based on SE toolkit Windows, Linux, HP-UX and AIX collectors also available Extendable to include application data and web servers Central web site with graphical plots Get it from Note: Get the latest build, not the old packaged release Widely used 2002 and 2004 Olympic Games, many others…. Solaris/Linux Performance Measurement and Tuning 4/20/2017

XE Toolkit - www.xetoolkit.com
Complete re-write of SE Toolkit by Rich Pettit Initial release 1.0 now available April 2007 Coded in Java, uses secure RMI transport, jar files Multi-platform support Solaris, Linux, Windows, BSD, OSX, AIX, HP-UX… Licencing Free GPL version for standard use and shared derivations Commercial support available if needed Commercial product license for custom in-house derivations Addresses all the issues people had with SE toolkit Solaris/Linux Performance Measurement and Tuning 4/20/2017

Processes Solaris/Linux Performance Measurement and Tuning 4/20/2017

Process based data - /proc
Used by ps, proctool and debuggers, pea.se, proc(1) tools on Solaris Solaris and Linux both have /proc/pid/metric hierarchy Linux also includes system information in /proc rather than kstat Advantages The recommended and supported process access API Metric data structures reasonably stable over releases Consistent data using locking Solaris microstate data provides accurate process state timers Disadvantages High overhead for open/read/close for every process Linux reports data as ascii text, Solaris as binary structures Solaris/Linux Performance Measurement and Tuning 4/20/2017

Tracing and profiling Tracing Tools Profiling Tools
truss - shows system calls made by a process sotruss / apitrace - shows shared library calls prex - controls TNF tracing for user and kernel code Profiling Tools Compiler profile feedback using -xprofile=collect and use Sampled profile relink using -p and prof/gprof Function call tree profile recompile using -pg and gprof Shared library call profiling setenv LD_PROFILE and gprof Accurate CPU timing for process using /usr/proc/bin/ptime Microstate process information using pea.se and pw.se 10:40:16 name lwmx pid ppid uid usr% sys% wait% chld% size rss pf nis_cachemgr jre sendmail se.sparc imapd dtmail Solaris/Linux Performance Measurement and Tuning 4/20/2017

Accounting Records Standard Unix System V Accounting - acct
Tiny, incomplete (no process id!) low resolution, no overhead! Solaris Extended System and Network Accounting - exacct Flexible, Overly complex, Detailed data Interval support for recording long running processes No overhead! 100% capture ratio for infrequent samples! Solaris/Linux Performance Measurement and Tuning 4/20/2017

Extracct for Solaris extracct tool to get extended acct data out in a useful form See for description and get code from Pre-compiled code for Solaris SPARC and x86. Solaris 8 to 10. Useful data is logged in regular columns for easy import Includes low overhead network accounting config file for TCP flows Interval accounting option to force all processes to cut records Automatic log filename generation and clean switching Designed to run directly as a cron job, useful today More work needed to interface output to SE toolkit and Orca Solaris/Linux Performance Measurement and Tuning 4/20/2017

Example Extracct Output
# ./extracct Usage: extracct [-vwr] [ file | -a dir ] -v: verbose -w: wracct all processes first -r: rotate logs -a dir: use acctadm.conf to get input logs, and write output files to dir The usual way to run the command will be from cron as shown 0 * * * * /opt/exdump/extracct -war /var/tmp/exacct > /dev/null 2>&1 2 * * * * /bin/find /var/adm/exacct -ctime +7 -exec rm {} \; This also shows how to clean up old log files, I only delete the binary files in this example, and I created /var/tmp/exacct to hold the text files. The process data in the text file looks like this: timestamp locltime duration procid ppid uid usr sys majf rwKB vcxK icxK sigK sycK arMB mrMB command :26: acctadm :26: sh :26: exdump :09: init :09: pageout Solaris/Linux Performance Measurement and Tuning 4/20/2017

What would you say if you were asked:
How busy is that system? A: I have no idea… A: 10% A: Why do you want to know? A: I’m sorry, you don’t understand your question…. Solaris/Linux Performance Measurement and Tuning 4/20/2017

Headroom Estimation CPU Capacity Network Usage Memory Capacity
Relatively easy to figure out Network Usage Use bytes not packets/s Memory Capacity Tricky - easier in Solaris 8 Disk Capacity Can be very complex Solaris/Linux Performance Measurement and Tuning 4/20/2017

Headroom Headroom is available usable resources
Total Capacity minus Peak Utilization and Margin Applies to CPU, RAM, Net, Disk and OS Headroom Utilization Margin Solaris/Linux Performance Measurement and Tuning 4/20/2017

Utilization Utilization is the proportion of busy time
Always defined over a time interval Utilization Solaris/Linux Performance Measurement and Tuning 4/20/2017

Response Time Response Time = Queue time + Service time
The Usual Assumptions… Steady state averages Random arrivals Constant service time M servers processing the same queue Approximations Queue length = Throughput x Response Time (Little's Law) Response Time = Service Time / (1 - UtilizationM) Solaris/Linux Performance Measurement and Tuning 4/20/2017

Response Time Curves The traditional view of Utilization as a proxy for response time Systems with many CPUs can run at higher utilization levels, but degrade more rapidly when they run out of capacity Headroom margin should be set according to a response time target. Headroom margin R = S / (1 - (U%)m) Solaris/Linux Performance Measurement and Tuning 4/20/2017

So what's the problem with Utilization?
Unsafe assumptions! Complex adaptive systems are not simple! Random arrivals? Bursty traffic with long tail arrival rate distribution Constant service time? Variable clock rate CPUs, inverse load dependent service time Complex transactions, request and response dependent M servers processing the same queue? Virtual servers with varying non-integral concurrency Non-identical servers or CPUs, Hyperthreading, Multicore, NUMA Measurement Errors? Mechanisms with built in bias, e.g. sampling from the scheduler clock Platform and release specific systemic changes in accounting of interrupt time Solaris/Linux Performance Measurement and Tuning 4/20/2017

Threaded CPU Pipelines
CPU microarchitecture optimizations Extra register sets working with one execution pipeline When the CPU stalls on a memory read, it switches registers/threads Operating system sees multiple schedulable entities (CPUs) Intel Hyperthreading Each CPU core has an extra thread to use spare cycles Typical benefit is 20%, so total capacity is 1.2 CPUs I.e. Second thread much slower when first thread is busy Hyperthreading aware optimizations in recent operating systems Sun “CoolThreads” "Niagara" SPARC CPU has eight cores, one shared floating point unit Each CPU core has four threads, but each core is a very simple design Behaves like 32 slow CPUs for integer, snail like uniprocessor for FP Overall throughput is very high, performance per watt is exceptional New Niagara 2 has dedicated FPU and 8 threads per core (total 64 threads) Solaris/Linux Performance Measurement and Tuning 4/20/2017

Variable Clock Rate CPUs
Laptop and other low power devices do this all the time Watch CPU usage of a video application and toggle mains/battery power…. Server CPU Power Optimization - AMD PowerNow!™ AMD Opteron server CPU detects overall utilization and reduces clock rate Actual speeds vary, but for example could reduce from 2.6GHz to 1.2GHz Changes are not understood or reported by operating system metrics Speed changes can occur every few milliseconds (thermal shock issues) Dual core speed varies per socket, Quad core varies per core Quad core can dynamically stop entire cores to save power Possible scenario: You estimate 20% utilization at 2.6GHz You see 45% reported in practice (at 1.2GHz) Load doubles, reported utilization drops to 40% (at 2.6GHz) Actual mapping of utilization to clock rate is unknown at this point Note: Older and "low power" Opterons used in blades fix clock rate Solaris/Linux Performance Measurement and Tuning 4/20/2017

Virtual Machine Monitors
VMware, Xen, IBM LPARs etc. Non-integral and non-constant fractions of a machine Naiive operating systems and applications that don't expect this behavior However, lots of recent tools development from vendors Average CPU count must be reported for each measurement interval VMM overhead varies, application scaling characteristics may be affected Solaris/Linux Performance Measurement and Tuning 4/20/2017

Measurement Errors Mechanisms with built in bias
e.g. sampling from the scheduler clock underestimates CPU usage Solaris 9 and before, Linux, AIX, HP-UX “sampled CPU time” Solaris 10 and HP-UX “measured CPU time” far more accurate Solaris microstate process accounting always accurate but in Solaris 10 microstates are also used to generate system-wide CPU Accounting of interrupt time Platform and release specific systemic changes Solaris 8 - sampled interrupt time spread over usr/sys/idle Solaris 9 - sampled interrupt time accumulated into sys only Solaris 10 - accurate interrupt time spread over usr/sys/idle Solaris 10 Update 1 - accurate interrupt time in sys only Solaris/Linux Performance Measurement and Tuning 4/20/2017

Storage Utilization Storage virtualization broke utilization metrics a long time ago Host server measures busy time on a "disk" Simple disk, "single server" response time gets high near 100% utilization Cached RAID LUN, one I/O stream can report 100% utilization, but full capacity supports many threads of I/O since there are many disks and RAM buffering New metric - "Capability Utilization" Adjusted to report proportion of actual capacity for current workload mix Measured by tools such as Ortera Atlas ( Solaris/Linux Performance Measurement and Tuning 4/20/2017

How to plot Headroom Measure and report absolute CPU power if you can get it… Plot shows headroom in blue, margin in red, total power tracking day/night workload variation, plotted as mean + two standard deviations. Solaris/Linux Performance Measurement and Tuning 4/20/2017

“Cockcroft Headroom Plot”
Scatter plot of response time (ms) vs. Throughput (KB) from iostat metrics Histograms on axes Throughput time series plot Shows distributions and shape of response time Fits throughput weighted inverse gaussian curve Coded using "R" statistics package Blogged development at Solaris/Linux Performance Measurement and Tuning 4/20/2017

Response Time vs. Throughput
A different problem… Thread-limited appserver CPU utilization is low Measurements are of a single SOA service pool Response is in milliseconds Throughput is executions/s Exec Resp Min. : Min. : 1st Qu.: st Qu.: Median : Median : Mean : Mean : 3rd Qu.: rd Qu.: Max. : Max. : Solaris/Linux Performance Measurement and Tuning 4/20/2017

How busy is that system again?
Check your assumptions… Record and plot absolute capacity for each measurement interval Plot response time as a function of throughput, not just utilization SOA response characteristics are complicated… More detailed discussion in CMG06 Paper and blog entries “Utilization is Virtually Useless as a Metric” - Adrian Cockcroft - CMG06 Solaris/Linux Performance Measurement and Tuning 4/20/2017

CPU Solaris/Linux Performance Measurement and Tuning 4/20/2017

CPU Capacity Measurements
CPU Capacity is defined by CPU type and clock rate, or a benchmark rating like SPECrateInt2000 CPU throughput - CPU scheduler transaction rate measured as the number of voluntary context switches CPU Queue length CPU load average gives an approximation via a time decayed average of number of jobs running and ready to run CPU response time Solaris microstate accounting measures scheduling delay CPU utilization Defined as busy time divided by elapsed time for each CPU Badly distorted and undermined by virtualization…… Solaris/Linux Performance Measurement and Tuning 4/20/2017

CPU time measurements Biased sample CPU measurements
See 1998 Paper "Unix CPU Time Measurement Errors" Microstate measurements are accurate, but are platform and tool specific. Sampled metrics are more inaccurate at low utilization CPU time is sampled by the 100Hz clock interrupt sampling theory says this is accurate for an unbiased sample the sample is very biased, as the clock also schedules the CPU daemons that wakeup on the clock timer can hide in the gaps problem gets worse as the CPU gets faster Increase clock interrupt rate? (Solaris) set hires_tick=1 sets rate to 1000Hz, good for realtime wakeups harder to hide CPU usage, but slightly higher overhead Use measured CPU time at per-process level microstate accounting takes timestamp on each state change very accurate and also provides extra information still doesn’t allow for interrupt overhead Prstat -m and the pea.se command uses this accurate measurement Solaris/Linux Performance Measurement and Tuning 4/20/2017

More CPU Measurement Issues
Platform and release specific details Are interrupts included in system time? It depends… Is vmstat CPU sampled (Linux) or measured (Solaris 10)? Load average includes CPU queue (Solaris) or CPU+Disk (Linux) Wait for I/O is a misleading subset of idle time, metric removed in Solaris 10, ignore it in all other Unix/Linux releases Solaris/Linux Performance Measurement and Tuning 4/20/2017

Controlling and CPUs in Solaris
psrinfo - show CPU status and clock rate Corestat - show internal behavior of multi-core CPUs psradm - enable/disable CPUs pbind - bind a process to a CPU psrset - create sets of CPUs to partition a system At least one CPU must remain in the default set, to run kernel services like NFS threads All CPUs still take interrupts from their assigned sources Processes can be bound to sets mpstat shows per-CPU counters (per set in Solaris 9) CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl Solaris/Linux Performance Measurement and Tuning 4/20/2017

Monitoring CPU mutex lock statistics
To fix mutex contention change the application workload or upgrade to a newer OS release Locking strategies are too complex to be patched Lockstat Command very powerful and easy to use Solaris 8 extends lockstat to include kernel CPU time profiling dynamically changes all locks to be instrumented displays lots of useful data about which locks are contending # lockstat sleep 5 Adaptive mutex spin: 3318 events Count indv cuml rcnt spin Lock Caller % 18% flock_lock cleanlocks+0x10 % 27% xf597aab dev_get_dev_info+0x4c % 35% xf597aab mod_rele_dev_by_major+0x2c % 42% xf597aab cdev_size+0x74 % 47% xf5b3c ddi_prop_search_common+0x50 Solaris/Linux Performance Measurement and Tuning 4/20/2017

Network Solaris/Linux Performance Measurement and Tuning 4/20/2017

Network protocol data Based on a streams module interface in Solaris Solaris 2 ndd interface used to configure protocols and interfaces Solaris 2 mib interface used by netstat -s and snmpd to get TCP stats etc. Advantages Individual named metrics reasonably stable over releases Consistent data using locking Extensible to add metrics without breaking existing code Solaris ndd can retune TCP online without reboot System data is often also made available via SNMP prototcol Disadvantages Underlying API is not supported, SNMP access is preferred Solaris/Linux Performance Measurement and Tuning 4/20/2017

Network interface and NFS metrics
Network interface throughput counters from kstat rbytes, obytes — read and output byte counts multircv, multixmt — multicast byte counts brdcstrcv, brdcstxmt — broadcast byte counts norcvbuf, noxmtbuf — buffer allocation failure counts NFS Client Statistics Shown in iostat on Solaris crun% iostat -xnP extended device Statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device crun:vold(pid363) servdist:/usr/dist servhome:/export/home/adrianc servhome:/var/mail c0t2d0s0 c0t2d0s2 Solaris/Linux Performance Measurement and Tuning 4/20/2017

How NFS Works Showing the many layers of caching involved

Network Capacity Measurements
Network Interface Throughput Byte and packet rates input and output TCP Protocol Specific Throughput TCP connection count and connection rates TCP byte rates input and output NFS/SMB Protocol Specific Throughput Byte rates read and write NFS/SMB service response times HTTP Protocol Specific Throughput HTTP operation rates Get and post payload byte rates and size distribution Solaris/Linux Performance Measurement and Tuning 4/20/2017

TCP - A Simple Approach Capacity and Throughput Metrics to Watch
Connections Current number of established connections New outgoing connection rate (active opens) Outgoing connection attempt failure rate New incoming connection rate (passive opens) Incoming connection attempt failure rate (resets) Throughput Input and output byte rates Input and output segment rates Output byte retransmit percentage Solaris/Linux Performance Measurement and Tuning 4/20/2017

Obtaining Measurements
Get the TCP MIB via SNMP or netstat -s Standard TCP metric names: tcpCurrEstab: current number of established connections tcpActiveOpens: number of outgoing connections since boot tcpAttemptFails: number of outgoing failures since boot tcpPassiveOpens: number of incoming connections since boot tcpOutRsts: number of resets sent to reject connection tcpEstabResets: resets sent to terminate established connections (tcpOutRsts - tcpEstabResets): incoming connection failures tcpOutDataSegs, tcpInDataSegs: data transfer in segments tcpRetransSegs: retransmitted segments Solaris/Linux Performance Measurement and Tuning 4/20/2017

Internet Server Issues
TCP Connections are expensive TCP is optimized for reliable data on long lived connections Making a connection uses a lot more CPU than moving data Connection setup handshake involves several round trip delays Each open connection consumes about 1 KB plus data buffers Pending connections cause “listen queue” issues Each new connection goes through a “slow start” ramp up Other TCP Issues TCP windows can limit high latency high speed links Lost or delayed data causes time-outs and retransmissions Solaris/Linux Performance Measurement and Tuning 4/20/2017

TCP Sequence Diagram for HTTP Get

Stalled HTTP Get and Persistent HTTP

Memory Solaris/Linux Performance Measurement and Tuning 4/20/2017

Memory Capacity Measurements
Physical Memory Capacity Utilization and Limits Kernel memory, Shared Memory segment Executable code, stack and heap File system cache usage, Unused free memory Virtual Memory Capacity - Paging/Swap Space When there is no more available swap, Unix stops working Memory Throughput Hardware counter metrics can track CPU to Memory traffic Page in and page out rates Memory Response Time Platform specific hardware memory latency makes a difference, but hard to measure Time spent waiting for page-in is part of Solaris microstate accounting Solaris/Linux Performance Measurement and Tuning 4/20/2017

Page Size Optimization
Systems may support large pages for reduced overhead Solaris support is more dynamic/flexible than Linux at present Intimate Shared Memory locks large pages in RAM No swap space reservation Used for large database server Shared Global Area No good metrics to track usage and fragmentation issues Solaris ppgsz command can set heap and stack pagesize SPARC Architecture Base page size is 8KB, Large pages are 4MB Intel/AMD x86 Architectures Base page size is 4KB, Large pages are 2MB Solaris/Linux Performance Measurement and Tuning 4/20/2017

Cache principles Temporal locality - “close in time”
If you need something frequently, keep it near you If you don’t use it for a while, put it back If you change it, save the change by putting it back Spacial locality - “close in space - nearby” If you go to get one thing, get other stuff that is nearby You may save a trip by prefetching things You can waste bandwidth if you fetch too much you don’t use Caches work well with randomness Randomness prevents worst case behaviour Deterministic patterns often cause cache busting accesses Very careful cache friendly tuning can give great speedups Solaris/Linux Performance Measurement and Tuning 4/20/2017

The memory go round - Unix/Linux
Memory usage flows between subsystems Memory is one of the main resources in the system. There are four main consumers of memory, and when memory is needed, they all obtain it from the free list. The figure shows these consumers and how they relate to the free list. When memory is needed, it is taken from the head of the free list. When memory is put back on the free list, there are two choices. If the page still contains valid data, it is put on the tail of the list so it will not be reused for as long as possible. If the page has no useful content, it is put at the head of the list for immediate reuse. The kernel keeps track of valid pages in the free list so that they can be reclaimed if their content is requested, thereby saving a disk I/O. The vmstat reclaim counter is two-edged. On one hand, it is good that a page fault was serviced by a reclaim, rather than a page-in that would cause a disk read. On the other hand, you don’t want active pages to be stolen and end up on the free list in the first place. The vmstat free value is simply the size of the free list, in Kbytes. The way the size varies is what tends to confuse people. The most important value reported by vmstat is the scan rate - sr. If it is zero or close to zero, then you can be sure that the system does have sufficient memory. If it is always high (hundreds to thousands of pages/second), then adding more memory is likely to help. Solaris/Linux Performance Measurement and Tuning 4/20/2017

The memory go round - Solaris 8 and Later
Memory usage flows between subsystems Solaris/Linux Performance Measurement and Tuning 4/20/2017

Swap space Swap is very confusing and badly instrumented!
# se swap.se ani_max ani_resv ani_free availrmem swapfs_minfree 1972 ramres swap_resv swap_alloc swap_avail swap_free 49868 Misleading data printed by swap -s K allocated K reserved = K used, K available Corrected labels: K allocated K unallocated = K reserved, K available Mislabelled sar -r 1 freeswap (really swap available) blocks Useful swap data: Total swap 520 M available 369 M reserved 151 M Total disk 428 M Total RAM 92 M # swap -s total: k bytes allocated k reserved = k used, k available # sar -r 1 18:40:51 freemem freeswap 18:40: Swap space is really a misnomer of what really is the paging space. Almost all the accesses are page related rather than being whole-process swapping. Swap space is allocated from spare RAM and from swap disk. The measures provided are based on two sets of underlying numbers. One set relates to physical swap disk, the other set relates to RAM used as swap space by pages in memory. Swap space is used in two stages. When memory is requested (for example, via a malloc call) swap is reserved and a mapping is made against the /dev/zero device. Reservations are made against available disk-based swap to start with. When that is all gone, RAM is reserved instead. When these pages are first accessed, physical pages are obtained from the free list and filled with zeros, and pages of swap become allocated rather than reserved. In effect, reservations are initially taken out of disk-based swap, but allocations are initially taken out of RAM-based swap. When a page of anonymous RAM is stolen by the page scanner, the data is written out to the swap space, i.e., the swap allocation is moved from memory to disk, and the memory is freed. Memory space that is mapped but never used stays in the reserved state, and the reservation consumes swap space. This behavior is common for large database systems and is the reason why large amounts of swap disk must be configured to run applications like Oracle and SAP R/3, even though they are unlikely to allocate all the reserved space. The first swap partition allocated is also used as the system dump space to store a kernel crash dump into. It is a good idea to have plenty of disk space set aside in /var/crash and to enable savecore by uncommenting the commands in /etc/rc2.d/S20sysetup. If you forget and think that there may have been an unsaved crash dump, you can try running savecore long after the system has rebooted. The crash dump is stored at the very end of the swap partition, and the savecore command can tell if it has been overwritten yet. Solaris/Linux Performance Measurement and Tuning 4/20/2017

Disk Solaris/Linux Performance Measurement and Tuning 4/20/2017

Disk Capacity Measurements
Detailed metrics vary by platform Easy for the simple disk cases Hard for cached RAID subsystems Almost Impossible for shared disk subsystems and SANs Another system or volume can be sharing a backend spindle, when it gets busy your own volume can saturate, even though you did not change your own workload! Solaris/Linux Performance Measurement and Tuning 4/20/2017

Solaris Filesystem issues
ufs - standard, reliable, good for lots of small files ufs with transaction log - faster writes and recovery tmpfs - fastest if you have enough RAM, volatile NFS NFS2 - safe and common, 8KB blocks, slow writes NFS3 - more readahead and writebehind, faster default 32KB block size - fast sequential, may be slow random default TCP instead of UDP, more robust over WAN NFS4 - adds stateful behavior cachefs - good for read-mostly NFS speedup Veritas VxFS - useful on old Solaris releases Solaris 8 UFS Upgrade ufs was extended to be more competitive with VxFS transaction log unbuffered direct access option and snapshot backup capability now available “for free” with Solaris 8 Solaris/Linux Performance Measurement and Tuning 4/20/2017

Solaris 10 ZFS - What it doesn't have....
Nice features No extra cost - its bundled in a free OS No volume manager - its built in No space management - file systems use a common pool No long wait for newfs to finish - create a 3TB file system in a second No fsck - its transactional commit means its consistent on disk No slow writes - disk write caches are enabled and flushed reliably No random or small writes - all writes are large batched sequential No rsync - snapshots can be differenced and replicated remotely No silent data corruption - all data is checksummed as it is read No bad archives - all the data in the file system is scrubbed regularly No penalty for software RAID - RAID-Z has a clever optimization No downtime - mirroring, RAID-Z and hot spares No immediate maintenance - double parity disks if you need them Wish-list No way to know how much performance headroom you have! No clustering support Solaris/Linux Performance Measurement and Tuning 4/20/2017

Linux Filesystems There are a large number of options! EXT3 XFS
EXT3 Common default for many Linux distributions Efficient for CPU and space, small block size relatively simple for reliability and recovery Journalling support options can improve performance EXT4 is in development XFS Based on Silicon Graphics XFS, mature and reliable Better for large files and streaming throughput High Performance Computing heritage Solaris/Linux Performance Measurement and Tuning 4/20/2017

Disk Configurations Sequential access is ~10 times faster than random
Sequential rates are now about MB/s per disk Random rates are 166 operations/sec, (250/sec at 15000rpm) The size of each random read should be as big as possible Reads should be cached in main memory “The only good fast read is the one you didn’t have to do” Database shared memory or filesystem cache is microseconds Disk subsystem cache is milliseconds, plus extra CPU load Underlying disk is ~6ms, as its unlikely that data is in cache Writes should be cached in nonvolatile storage Allows write cancellation and coalescing optimizations NVRAM inside the system - Direct access to Flash storage Solid State Disks based on Flash are the "Next Big Thing" Solaris/Linux Performance Measurement and Tuning 4/20/2017

Slow idle disks explained
extended disk statistics disk r/s w/s Kr/s Kw/s wait actv svc_t %w %b sd sd Why do these disks have high svc_t when they are idle? Use prex to turn on kernel TNF probes for disk I/O sdstrategy is called when an I/O is started biodone is called when it completes match the pairs of TNF records to see the time sequences We find a burst of writes from pid 3 every 30s fsflush is updating inodes scattered all over the filesystem all writes are issued back to back without waiting to complete a long queue forms, each write taking on average ~10ms to service, but response (svc_t) includes a long queue time Typically 20 or so writes each 30s is 0% busy, ms svc_t Solaris/Linux Performance Measurement and Tuning 4/20/2017

Disk Throughput Solaris/Linux Performance Measurement and Tuning
4/20/2017

Max and Avg Disk Utilization (Same data)

Data from iostat What can we see here? sd7 root ufs solid state disks
extended disk statistics disk r/s w/s Kr/s Kw/s wait actv svc_t %w %b sd sd sd sd sd sd sd sd sd sd sd sd sd sd sd sd sd sd sd sd sd sd sd7 root ufs solid state disks stripe 8K RR stripe cached write log Solaris/Linux Performance Measurement and Tuning 4/20/2017

Simple Disks Utilization shows capacity usage Response time is svc_t
Measured using iostat %b Response time is svc_t svc_t increases due to waiting in the queues caused by bursty loads Service time per I/O is Util/IOPS Calculate as(%b/100)/(rps+wps) Decreases due to optimization of queued requests as load increases Solaris/Linux Performance Measurement and Tuning 4/20/2017

Single Disk Parameters
e.g. Seagate 18GB ST318203FC Obtain from RPM = = 6.0ms = 166/s Avg read seek = 5.2ms Avg write seek = 6.0ms Avg transfer rate = 24.5 MB/s Random IOPS Approx 166/s for small requests Approx 24.5/size for large requests Solaris/Linux Performance Measurement and Tuning 4/20/2017

Mirrored Disks All writes go to both disks Read policy alternatives
All reads from one side Alternate from side to side Split by block number to reduce seek Read both and use first to respond Simple Capacity Assumption Assume duplicated interconnects Same capacity as unmirrored Solaris/Linux Performance Measurement and Tuning 4/20/2017

Concatenated and Fat Stripe Disks
Request size less than interlace Requests go to one disk Single threaded requests Same capacity as single disk Multithreaded requests Same service time as one disk Throughput of N disks if more than N threads are evenly distributed Solaris/Linux Performance Measurement and Tuning 4/20/2017

Striped Disks Request size more than interlace
Requests split over N disks Single and multithreaded requests N = request size / interlace Throughput of N disks Service Time Reduction Reduced size of request reduces service time for large transfers Need to wait for all disks to complete - slowest dominates Solaris/Linux Performance Measurement and Tuning 4/20/2017

RAID5 for Small Requests
log Writes must calculate parity Read parity and old data blocks Calculate new parity Write log and data and parity Triple service time One third throughput of one disk Read performs like stripe Throughput of N-1, service of one Degraded mode throughput about one Solaris/Linux Performance Measurement and Tuning 4/20/2017

RAID5 for Large Requests
log RAID5 for Large Requests Write full stripe and parity Capacity similar to stripe Similar read and write performance Throughput of N-1 disks Service time for size reduced by N-1 Less interconnect load than mirror Degraded Mode Throughput halved and service similar Extra CPU used to regenerate data Solaris/Linux Performance Measurement and Tuning 4/20/2017

Cached RAID5 Nonvolatile cache Fast service time for writes
No need for recovery log disk Fast service time for writes Interconnect transfer time only Cache optimizes RAID5 Makes all backend writes full stripe Solaris/Linux Performance Measurement and Tuning 4/20/2017

Cached Stripe Write caching for stripes Optimizations
Greatly reduced service time Very worthwhile for small transfers Large transfers should not be cached In many cases, 128KB is crossover point from small to large Optimizations Rewriting same block cancels in cache Small sequential writes coalesce Solaris/Linux Performance Measurement and Tuning 4/20/2017

Capacity Model Measurements
Derived from iostat outputs extended disk statistics disk r/s w/s Kr/s Kw/s wait actv svc_t %w %b sd Utilization U = %b / 100 = 0.27 Throughput X = r/s + w/s = 41.8 Size K = Kr/s + Kw/s / X = 8.2K Concurrency N = actv = 2.3 Service time S = U / X = 6.5ms Response time R = svc_t = 15.8ms Solaris/Linux Performance Measurement and Tuning 4/20/2017

Cache Throughput Hard to model clustering and write cancellation improvements Make pessimistic assumption that throughput is unchanged Primary benefit of cache is fast response time Writes can flood cache and saturate back-end disks Service times suddenly go from 3ms to 300ms Very hard to figure out when this will happen Paranoia is a good policy…. Solaris/Linux Performance Measurement and Tuning 4/20/2017

Concluding Summary Walk out of here with the most useful content fresh in your mind!

Quick Tips #1 - Disk The system will usually have a disk bottleneck
Track how busy is the busiest disk of all Look for unbalanced, busy or slow disks with iostat Options: timestamp, look for busy controllers, ignore idle disks: % iostat -xnzCM -T d 30 Tue Jan 21 09:19: extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device c0 c0t0d0 c0t1d0 Watch out for sd_max_throttle limiting throughput when set too low Watch out for RAID cache being flooded on writes, causes sudden very large increase in write service time Solaris/Linux Performance Measurement and Tuning 4/20/2017

Quick Tips #2 - Network If you ever see a slow machine that also appears to be idle, you should suspect a network lookup problem. i.e. the system is waiting for some other system to respond. Poor Network Filesystem response times may be hard to see Use iostat -xn 30 on a Solaris client wsvc_t is the time spent in the client waiting to send a request asvc_t is the time spent in the server responding %b will show 100% whenever any requests are being processed, it does NOT mean that the network server is maxed out, as an NFS server is a complex system that can serve many requests at once. Name server delays are also hard to detect Overloaded LDAP or NIS servers can cause problems DNS configuration errors or server problems often cause 30s delays as the request times out Solaris/Linux Performance Measurement and Tuning 4/20/2017

Quick Tips #3 - Memory Avoid the common vmstat misconceptions
The first line is average since boot, so ignore it Linux, Other Unix and earlier Solaris Releases Ignore “free” memory Use high page scanner “sr” activity as your RAM shortage indicator Solaris 8 and Later Releases Use “free” memory to see how much is left for code to use Use non-zero page scanner “sr” activity as your RAM shortage indicator Don’t panic when you see page-ins and page-outs in vmstat Normal filesystem activity uses paging solaris9% vmstat 30 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr f0 s0 s1 s6 in sy cs us sy id Solaris/Linux Performance Measurement and Tuning 4/20/2017

Quick Tips #4 - CPU Look for a long run queue (vmstat procs r) - and add CPUs To speedup with a zero run queue you need faster CPUs, not more of them Check for CPU system time dominating user time Most systems should have lots more Usr than Sys, as they are running application code But... dedicated NFS servers should be 100% Sys And... dedicated web servers have high Sys as well So... assume that lots of network service drives Sys time Watch out for processes that hog the CPU Big problem on user desktop systems - look for looping web browsers Web search engines may get queries that loop Use resource management or limit cputime (ulimit -t) in startup scripts to terminate web queries Solaris/Linux Performance Measurement and Tuning 4/20/2017

Quick Tips #5 - I/O Wait Look for processes blocked waiting for disk I/O (vmstat procs b) This is what causes CPU time to be counted as wait not idle Nothing else ever causes CPU wait time! CPU wait time is a subset of idle time, consumes no resources CPU wait time is not calculated properly on multiprocessor machines on older Solaris releases, it is greatly inflated! CPU wait time is no longer calculated, zero in Solaris 10 Bottom line - don’t worry about CPU wait time, it’s a broken metric Look at individual process wait time using microstates prstat -m or SE toolkit process monitoring Look at I/O wait time using iostat asvc_t Solaris/Linux Performance Measurement and Tuning 4/20/2017

Quick Tips #6 - iostat For Solaris remember “expenses” iostat -xPncez 30 Add -M for Megabytes, and -T d for timestamped logging Use 30 second interval to avoid spikes in load. Watch asvc_t which is the response time for Solaris Look for regular disks over 5% busy that have response times of more than 10ms as a problem. If you have cached hardware RAID, look for response times of more than 5ms as a problem. Ignore large response times on idle disks that have filesystems - its not a problem and the cause is the fsflush process Solaris/Linux Performance Measurement and Tuning 4/20/2017

Recipe to fix a slow system
Essential Background Information What is the business function of the system? Who and where are the users? Who says there is a problem, and what is slow? What changed recently and what is on the way? What is the system configuration? CPU/RAM/Disk/Net/OS/Patches, what application software is in use? What are the busy processes on the system doing? use top, prstat, pea.se or /usr/ucb/ps uax | head Report CPU and disk utilization levels, iostat -xPncezM -T d 30 What is making the disks busy? What is the network name service configuration? How much network activity is there? Use netstat -i 30 or nx.se 30 Is there enough memory? Check free memory and the scan rate with vmstat 30 Solaris/Linux Performance Measurement and Tuning 4/20/2017

Further Reading - Books
General Solaris/Unix/Linux Performance Tuning System Performance Tuning (2nd Edition) by Gian-Paolo D. Musumeci and Mike Loukides; O'Reilly & Associates Solaris Performance Tuning Books Solaris Performance and Tools, Richard McDougall, Jim Mauro, Brendan Gregg; Prentice Hall Configuring and Tuning Databases on the Solaris Platform, Allan Packer; Prentice Hall Sun Performance and Tuning, by Adrian Cockcroft and Rich Pettit; Prentice Hall Sun BluePrints™ Capacity Planning for Internet Services, Adrian Cockcroft and Bill Walker; Prentice Hall Resource Management, Richard McDougall, Adrian Cockcroft et al. Prentice Hall Linux Linux Performance Tuning and Capacity Planning by Jason R. Fink and Matthew D. Sherer Google has a Linux specific search mode Solaris/Linux Performance Measurement and Tuning 4/20/2017

Questions? (The End) Solaris/Linux Performance Measurement and Tuning
4/20/2017

Solaris/Linux Performance Measurement and Tuning

Similar presentations

Presentation on theme: "Solaris/Linux Performance Measurement and Tuning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Solaris/Linux Performance Measurement and Tuning

Similar presentations

Presentation on theme: "Solaris/Linux Performance Measurement and Tuning"— Presentation transcript:

Similar presentations

About project

Feedback