Performance Evaluations An overview and some lessons learned “Knowing is not enough; we must apply. Willing is not enough; we must do.” Johann Wolfgang.

Performance Evaluations An overview and some lessons learned “Knowing is not enough; we must apply. Willing is not enough; we must do.” Johann Wolfgang von Goethe

2 Performance Evaluations - 2007 (U. Roehm) Performance Evaluation Research methodology: – quantitative evaluation – of an existing software or hardware artefact SUT - ‘system under test’ – using a set of experiments – each consisting of a series of performance measurements – to collect realistic performance data

3 Performance Evaluations - 2007 (U. Roehm) Performance Evaluations Prepare – What do you want to measure? Plan – How can you measure and what is needed? Implement – Pitfalls for the correct implementation of benchmarks. Evaluate – How to conduct performance experiments. Analyse and Visualise – What does the outcome mean?

4 Performance Evaluations - 2007 (U. Roehm) Step 1: Preparations What do you want to show? – Understanding of behaviour of existing system? – Or proof an approach’s superiority? Which performance metrics are you interested in? – Mean response times – Throughput – Scalability – … What are the variables? – Parameters of own system / algorithm – number of concurrent users, number of nodes, …

5 Performance Evaluations - 2007 (U. Roehm) Performance Metrics Response Time (rt) – Time duration the SUT takes to answer a request – Note: For complex tasks, response time ≠ runtime Mean Response Time (mrt) – Mean of the response times of a set of requests – Only successful requests count! Throughput (thp) – Number of successful requests per time unit

6 Performance Evaluations - 2007 (U. Roehm) Runtime versus Response Time Model: client-server communication with server queueing incoming requests (e.g. web servers or database servers) client sends a request a – runt(a) - runtime to complete request a runt(a) = t receiveN (last_result) - t send – rt(a) - response time for action a (until first result comes back to client!) rt(a) = t receive1 (first_result) - t send – wt(a) - waiting time of action a in server queue – et(a) - execution time of action a (after server took request from queue) – frt(a) - first result time of action a (Note: frt(a) <= et(a) ) – tc(a) - network transamission times for the request a and its result(s) tc(a) tc(result) t send t receiveN t arrival t end runt(a) wt(a)et(a) frt(a) rt(a) t receive1 queueexecutionnetwork tc(result)...

7 Performance Evaluations - 2007 (U. Roehm) More Performance Metrics Scalability / Speedup:thp n / thp 1 Fairness: Resource Consumption – memory usage – CPU load – energy consumption – … Note: The later are typically server-side statistics!

8 Performance Evaluations - 2007 (U. Roehm) Step 2: Planing Which experiments to show the intended results? What do you need to run those experiment? – Hardware – Software – Data (!) Prepare an evaluation schedule – Evaluations always take longer than you expect! – Expect to change some goals / your approach based on the outcome of initial experiments Some initial runs might be helpful to explore the space

9 Performance Evaluations - 2007 (U. Roehm) Typical Client-Server Evaluation Setup Client Server Test Network Server - Server Network Client Emulator(s) Test NetworkSystem Under Test (SUT) Often just one multithreaded client that emulates n concurrent clients In general, the SUT can be arbitrary complex; e.g. clustered servers or multi-tier architectures Response time and throughput is measured here The client emulator(s) should run on a separate machine than the server(s).

10 Performance Evaluations - 2007 (U. Roehm) Workload Specifications Multiprogramming Level (MPL) – How many concurrent users / clients? Heterogeneous workload? – Is every client doing the same or are there variations? – Typically: Well-defined set of transactions / request kinds with a defined distribution Do you emulate programs or users? – If just interested in peak performance, then as many requests as possible – Sometimes more complex user model needed E.g. TPC-C users with think times and sleep times

11 Performance Evaluations - 2007 (U. Roehm) Experimental Data Problem: How do we get reasonable test data? Approach 1: Tracing – If you have an existing system available, trace ‘typical’ usages and use such traces to drive your experiments Approach 2: Standard Benchmarks – Use the data generator of a standard benchmark – In some areas, there are explicit data corpi to evaluate Approach 3: Make something up yourself – Always the least preferable way!!! – Justify why think that your data setup is representative e.g. by using the pattern of a standard benchmark

12 Performance Evaluations - 2007 (U. Roehm) Standard Benchmarks There are many standard benchmarks available – very helpful to make results more comparable – most come with synthetic data generators – make your results more publishable (reviewers will have more trust in your experimental setup) Disadvantages: – Standard benchmarks can be very complex – Some specifications are not free, but cost money Examples : – TPC-C, TPC-H, TPC-R, TPC-W – ECPerf, SPECjAppServer, IBM’s Trade2, etc.

13 Performance Evaluations - 2007 (U. Roehm) Example: TPC Benchmarks TPC - Transaction Processing Council (tpc.org)tpc.org – Non-profit corporation of commercial software vendors – Defined a set of database and e-business performance benchmarks TPC-C – Measures the performance of OLTP systems (bank scenario) – V1.0 became official in 1992; current version v5.8 TPC-H and TPC-R (former TPC-D) – Performance of OLAP systems (warehouse scenario); – In 1999, TPC-D replaced by TPC-H (ad-hoc queries) + TPC-R (reporting) TPC-W and TPC-App – transactional web benchmark simulating interactive e-business website – TPC-W obsolete since April 2005; replaced(?) by TPC-App TPC-E – New OLTP benchmark that simulates the workload of a brokerage firm

14 Performance Evaluations - 2007 (U. Roehm) Step 3: Implementation Goal: Evaluation program(s) (client emulators) that measure what you have planned. Typical elements to take care of – Accurate timing – Random number generation – Fast logging – No hidden serialisation, e.g. via global singletons – No screen output during measurement interval Avoid measuring the test harness rather than SUT…

15 Performance Evaluations - 2007 (U. Roehm) Time Measurements Every programming language offers some timing functions But be aware that there is a timer resolution – E.g. Java’s System.CurrentTimeInMillis() suggests by its name that it measures time in milliseconds… The question is, how many milliseconds between updates… There is no point in trying to measure something taking microseconds with timers with milliseconds resolution!

16 Performance Evaluations - 2007 (U. Roehm) Example: Java Timing Standard (all JDK’s): System.currentTimeInMillis() – Be aware: Has different resolutions depending on OS and platforms! Since JDK 1.4.2 (portable, undocumented!!): sun.misc.Perf – Example: // may throw SecurityException sun.misc.Perf perf = sun.misc.Perf.getPerf(); long ticksPerSecond = perf.highResFrequency(); long currTick = perf.highResCounter(); long milliSeconds = (currTick * 1000) / ticksPerSecond; In JDK 1.5: java.lang.System.nanoTime() – Always uses the best precision available on a system, but no guaranteed resolution Some third party solutions using, e.g., Windows’ HighPerformanceTimers through Java JNI (hence limited portability, best for Windows) Linux (2.2, x86) 1 ms Mac OS X 1 ms Windows 200010 ms Windows 9860 ms Solaris (2.7/i386, 2.8/sun4u) 1 ms

17 Performance Evaluations - 2007 (U. Roehm) Example: Wrong Timer Usage

18 Performance Evaluations - 2007 (U. Roehm) Xmple:Same Experiment - High-res Timer Note the ‘faster’ response times as compared to the previous experiment Note: No chance to measure the duration of a cache hit with CurrentTimeInMillis()

19 Performance Evaluations - 2007 (U. Roehm) Random Number Generators Common Mistakes: – A multi-threaded client, but all threads use the same global object This effectively serialises your threads! – Large set of random numbers is generated within code We do not want to measure how fast Java can generate random numbers Use an array with pre-generated random numbers (space vs. time) – Seeds are the same You make your program deterministic…

20 Performance Evaluations - 2007 (U. Roehm) Logging Goal: Fast logging of results during experiments without interfering with the measurements Approach: – Log to a file, not the screen screen output / scrolling is VERY slow -Very common mistake – Use standard log libraries with low overhead e.g. Java’s log4j ( http://sourceforge.net/projects/log4j/ ) or Windows’ performance counter API – If your client reads data from a hard drive, write your log data to a different disk – Log asynchronously, be fast, be thread-safe (careful!)

21 Performance Evaluations - 2007 (U. Roehm) Windows Performance Monitor Windows includes a performance monitor application – online GUI – can capture to file – supports remote monitoring! Based on Windows API Performance Counters – \\ComputerName\Object(Instance)\Counter – supported by basically every server application; huge # of statistics – can be used in own programs – Cf. http://technet2.microsoft.com/WindowsServer/en/library/3fb01419-b1ab-4f52-a9f8-09d5ebeb9ef21033.mspx?mfr=true http://technet2.microsoft.com/WindowsServer/en/library/3fb01419-b1ab-4f52-a9f8-09d5ebeb9ef21033.mspx?mfr=true – From Java: http://www.javaworld.com/javaworld/jw-11-2004/jw-1108-windowspm.html http://www.javaworld.com/javaworld/jw-11-2004/jw-1108-windowspm.html

22 Performance Evaluations - 2007 (U. Roehm) Step 4: Evaluation Objective: To collect accurate performance data in a set of experiments. Three major issues: – Controlled evaluation environment – Documentation – Archiving of raw data

23 Performance Evaluations - 2007 (U. Roehm) Evaluation Environment We want a stable evaluation environment that allows us to measure the system under test under repeatable settings without interference – Clean computer initialisation (client(s) and server(s)) No concurrent programs, minimum set of sys services Disable anti-virus software! – Decide: open or closed system? – Decide: cold or warm caches? – Make sure you do not measure any side-effects! – Many prefer to measure during night - why?

24 Performance Evaluations - 2007 (U. Roehm) Open vs. Closed Systems Open system: clients arrive and leave the system (with appropriate distribution) Closed system: fix number of clients; each starts a new task after finishing the previous one Open system is generally more realistic Closed system is much easier to write as test harness Open system behaves much worse when contention is high See Schroeder et al, Proc NSDI’06

25 Performance Evaluations - 2007 (U. Roehm) Sequential Evaluation The special case of MPL = 1 – There are not multiple clients or the system is centralised Test harness is typically the same; also many of the implementation details still apply Several experiment repetitions – In order to determine stable statistics e.g. mean response time – Note: Same number of repetitions for all experiments!

26 Performance Evaluations - 2007 (U. Roehm) Multiprogramming level > 1 A parallel evaluation has three distinct phases: Performance can be measured either – periodically during steady phase, or – in summary at the end (measuring period = whole steady phase) then it must be repeated several times to the mean value Parallel Evaluation time performance ramp-up steady phaseclose-down Only measure during the steady-phase! Note: fixed MPL

27 Performance Evaluations - 2007 (U. Roehm) Performance Behaviour over MPL The measurements from the previous slide give you a single measuring point for one fixed MPL Typically interested in system behaviour of varying MPLs – One needs to conduct separate experiments for each MPL What to expect if we plot the throughput over the MPL? This? Or this?

28 Performance Evaluations - 2007 (U. Roehm) Outlier Policy Inherent complexity of the evaluated systems – There is always a ‘noise signal’ or ‘uncertainty factor’ We expect individual measurements to vary around a stable mean value – Note that the standard deviation can be quite high. – But some measurements are ‘way off’ <- have to be dealt with Outlier policy: – What is an outlier? e.g. more than n times the standard deviation off – How to deal with it? e.g. replace with a ‘spare’ value Trace outliers - if there are too many, are they still outliers?

29 Performance Evaluations - 2007 (U. Roehm) Evaluation Scripting Conducting evaluations is tedious and error-prone – E.g. to test scalability in example paper: 3 scenarios x 4 algorithms = 12 configurations 12 configurations x 10 MPLs x 21 measurements = 2520 – Note: The benchmark implementation is only one of those 2520 Need a ‘test harness’ - set of scripts/programs that – automates the evaluation as much as possible, and – makes the evaluation repeatable! start/stop servers, copy logging and configuration files, check for hanging runs etc. – Write scripts in the language that you prefer Python, shell scripts, … good knowledge of OS and shell scripting is helpful

30 Performance Evaluations - 2007 (U. Roehm) Example Evaluation Script REM @Echo OFF TITLE QueryTest REM parameters: REM %1 number of nodes REM … (this script has a bunch of parameters) REM ----------------------------------------------------------------------------------------- REM freshly start server Z:/programs/gnu/bin/sleep 10 start./server.exe Z:/programs/gnu/bin/sleep 20 REM configure server and prepare cache./pdbsetconfig Nodes=%2 MaxQueueSize=22 MaxLoad=1 WrapperType=%3 Nodes=%2 Routing=RoundRobin Password=%4 Username=%4 ScanInputQueue=FCFS./client Verbose=0 Directory=P:/tpc-r/%3/pdb_queries Files=CacheFlush.sql Loops=%1 REM run the actual measurement REM the result logging is implemented as part of the client program./pdbstatistics RESET IF %10==verbose./pdbgetconfig IF %10==verbose echo === Testing %5-Routing (maxload %6 history %7) with %8/%9ms IF %10==verbose./client Files=%8 Verbose=1 Directory=P:tpc-r/%3/pdb_queries Timeout=%9 Randomize=True ServerStats=True IF NOT %10==verbose./client Files=%8 Verbose=0 Directory=P:tpc-r/%3/pdb_queries Timeout=%9 Randomize=True ServerStats=True ExcelOutput="%1%6%7%8" REM stop server again REM first try normally, but if it hangs - kill it../pdbsetconfig Shutdown Z:/programs/gnu/bin/sleep 20 Z:/programs/reskit4/kill -F server.exe Archiving of results as part of the client; dir structure is parameterised in the script logs the current configuration and which test it is

31 Performance Evaluations - 2007 (U. Roehm) Evaluation Documentation Goal: Being able to verify your results, e.g. by re-running your experiments. Keep detailed documentation of what you are doing – Full disclosure of evaluation environment including hardware, OS, software with full version numbers and any patches/changes – Write a ‘Readme’: how you set-up and conduct the tests Chances are high that you want to get re-run something later, but already a few days(!) after you will not remember every detail anymore… Not to mention follow-up projects – Keep Evaluation Logbook When done, what evaluations etc.

32 Performance Evaluations - 2007 (U. Roehm) Evaluation: Archiving of Results Goal: To be able to analyse the results later in all details, even with regard to some aspects that you did not think about when you planned the evaluation Archive ALL RAW DATA of your results – not just average values etc. – include any server logfiles, error files, and configurations Best practice: – Keep a directory structure that corresponds to your evaluations – Collect all raw result files, logfiles, config files etc. for each individual experiment; include the environment description

33 Performance Evaluations - 2007 (U. Roehm) Example Result Archive Structure Evaluation – Cluster Mix 1 Evaluation Setup.Readme Cluster Mix 1.xls Serial locking -2006-8-12_run1 ·Client.log ·Server.log -2006-8-13_run2 ·… Object-level locking Field-level locking Semantic Locking – Cluster Mix 2 … – Semantic Mix … I prefer to have separate sheets in the Excel file for all raw results, and then one aggregation sheet for the means, which are plotted.

34 Performance Evaluations - 2007 (U. Roehm) Archiving: Common Mistakes No server logs – How do you verify that some effect wasn’t due to a fault? No configuration / environment description – Was this result before or after you changed some setting? No raw data values, but just aggregates UpdatesReadsStalenessDriftRead ratioModeMRTExecutionsAborts 5001002500090reader93.26541861981 500100250050090reader84.94232097721 5001002500100090reader72.22202387601 5001002500150090reader68.52692338 5001002500200090reader67.10242744181 5001002500 90reader66.095229533 Mins? Secs? Millis? How many runs? Standard deviation? Meaning of this column? contradicts to first two columns

35 Performance Evaluations - 2007 (U. Roehm) Step 5: Result Analysis Raw performance data analysed with statistical methods – Cf. last week’s lecture Important: – Standard deviation – Confidence intervals – Include error bars in graphs

36 Performance Evaluations - 2007 (U. Roehm) Step 6: Result Presentation Finally, some remarks on a good visualisation of the results Note: The following examples are copied from conference submissions for example purposes only, not to claim the authors…

37 Performance Evaluations - 2007 (U. Roehm) Example 1: Wrong Origin Which one is better and how much? LEACH or DEEAC? Y-axis does not start at 0! Misleads the actual performance difference and both graphs side-by-side are not comparable! Also note: What does the Y-axis show?

38 Performance Evaluations - 2007 (U. Roehm) Example 2: Wrong Scaling How would you describe the scalability of algorithm ‘cc’? Standard problem when using Excel the ‘easy way’...

39 Performance Evaluations - 2007 (U. Roehm) Example 3: ? Which approach is better? Ok, this is unfair as the authors used this graph explicitly to show that all approaches behave the same…

40 Performance Evaluations - 2007 (U. Roehm) Presentation of the Results Be consistent with naming, colors, and order – Two graphs a very hard to compare if you use different colors or names for the same things Axis’ start at 0; give descriptive axis titles with unit names Show error bars – If it becomes to messy, just show one and explain in text Excel and non-linear intervals == MESS – e.g. gnuplot much better Make sure everything is readable – use large fonts, thick curves (not just 1 point) and dark colors Everything in the graphs should be explained

41 Performance Evaluations - 2007 (U. Roehm) Links and References Performance Evaluations – Jim Gray. “The Benchmark Handbook: For Database and Transaction Processing Systems”. Morgan-Kaufman, 1992. – G. Haring, C. Lindemann and M. Reiser (eds.). “Performance Evaluation: Origins and Directions”. LNCS 1769, 2000. – B. Schroeder et al. “Open versus Closed: A Cautionary Tale”. In Proceedings of the USENIX NSDI’06, pp. 239-252, 2006. – P. Wu, A. Fekete and U. Roehm: “The Efficacy of Commutativity-Based Semantic Locking in Real-World Applications”. 2006. – U. Roehm. “OLAP with a Cluster of Databases”. DISBIS 80, 2002. Java Time Measurements – http://www.jsresources.org/faq_performance.html http://www.jsresources.org/faq_performance.html – V. Roubtsov, “My kingdom for a good timer!”, January 2003. URL: http://www.javaworld.com/javaworld/javaqa/2003-01/01-qa-0110-timing.html Java vs. C etc. – http://www.idiom.com/~zilla/Computer/javaCbenchmark.html

42 Performance Evaluations - 2007 (U. Roehm)

43 Performance Evaluations - 2007 (U. Roehm) Evaluation Framework Few open source tools for performance testing available We are currently working on a Java framework – Goal: to separate test driver from application Multi-threading, configuration, timing, logging… are always the same for different evaluations – Based on some.Net code used by colleagues at CSIRO

Performance Evaluations An overview and some lessons learned “Knowing is not enough; we must apply. Willing is not enough; we must do.” Johann Wolfgang.

Similar presentations

Presentation on theme: "Performance Evaluations An overview and some lessons learned “Knowing is not enough; we must apply. Willing is not enough; we must do.” Johann Wolfgang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance Evaluations An overview and some lessons learned “Knowing is not enough; we must apply. Willing is not enough; we must do.” Johann Wolfgang.

Similar presentations

Presentation on theme: "Performance Evaluations An overview and some lessons learned “Knowing is not enough; we must apply. Willing is not enough; we must do.” Johann Wolfgang."— Presentation transcript:

Similar presentations

About project

Feedback