ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1.

Slides:

Advertisements

Similar presentations

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.

Advertisements

Symantec 2010 Windows 7 Migration EMEA Results. Methodology Applied Research performed survey 1,360 enterprises worldwide SMBs and enterprises Cross-industry.

Simplifications of Context-Free Grammars

Variations of the Turing Machine

Lecture 8: Hypothesis Testing

AP STUDY SESSION 2.

Ecole Nationale Vétérinaire de Toulouse Linear Regression

RWTÜV Fahrzeug Gmbh, Institute for Vehicle TechnologyTÜV Mitte Group 1 GRB Working Group Acceleration Pattern Results of pass-by noise measurements carried.

Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.

STATISTICS HYPOTHESES TEST (I)

STATISTICS INTERVAL ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.

STATISTICS POINT ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.

Variance Estimation in Complex Surveys Third International Conference on Establishment Surveys Montreal, Quebec June 18-21, 2007 Presented by: Kirk Wolter,

David Burdett May 11, 2004 Package Binding for WS CDL.

Local Customization Chapter 2. Local Customization 2-2 Objectives Customization Considerations Types of Data Elements Location for Locally Defined Data.

Process a Customer Chapter 2. Process a Customer 2-2 Objectives Understand what defines a Customer Learn how to check for an existing Customer Learn how.

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.

Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION

Chapter 7 Sampling and Sampling Distributions

The 5S numbers game..

Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.

Biostatistics Unit 5 Samples Needs to be completed. 12/24/13.

Break Time Remaining 10:00.

The basics for simulations

Factoring Quadratics — ax² + bx + c Topic

EE, NCKU Tien-Hao Chang (Darby Chang)

Turing Machines.

PP Test Review Sections 6-1 to 6-6

Chi-Square and Analysis of Variance (ANOVA)

Localisation and speech perception UK National Paediatric Bilateral Audit. Helen Cullington 11 April 2013.

Bellwork Do the following problem on a ½ sheet of paper and turn in.

Oil & Gas Final Sample Analysis April 27, Background Information TXU ED provided a list of ESI IDs with SIC codes indicating Oil & Gas (8,583)

Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.

Hypothesis Tests: Two Independent Samples

Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.

Quantitative Analysis (Statistics Week 8)

1 TV Viewing Trends Rivière-du-Loup EM - Diary Updated Spring 2014.

Adding Up In Chunks.

MaK_Full ahead loaded 1 Alarm Page Directory (F11)

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.

1 Termination and shape-shifting heaps Byron Cook Microsoft Research, Cambridge Joint work with Josh Berdine, Dino Distefano, and.

Artificial Intelligence

Before Between After.

1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)

1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.

© The McGraw-Hill Companies, Inc., Chapter 10 Testing the Difference between Means and Variances.

1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.

Types of selection structures

Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.

Essential Cell Biology

Converting a Fraction to %

Chapter Thirteen The One-Way Analysis of Variance.

Ch 14 實習(2).

Chapter 8 Estimation Understandable Statistics Ninth Edition

Clock will move after 1 minute

PSSA Preparation.

Copyright © 2013 Pearson Education, Inc. All rights reserved Chapter 11 Simple Linear Regression.

Experimental Design and Analysis of Variance

Simple Linear Regression Analysis

Physics for Scientists & Engineers, 3rd Edition

Select a time to count down from the clock above

Copyright Tim Morris/St Stephen's School

9. Two Functions of Two Random Variables

1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.

Presentation transcript:

ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 What do we want to establish?  By comparing an old and a new system rigorously, find  If there is a performance change?  How large is the change?  What variation we expect?  How confident are we of the result?  How many experiments must we carry out? new execution time old execution time 2

ISMM, Seattle, June 2013 Uncertainty  Computer systems are complex.  Many factors influence performance:  Some known.  Some out of experimenter’s control.  Some non-deterministic.  Execution times vary.  We need to design experiments and summarise results in a repeatable and reproducible fashion. 3

ISMM, Seattle, June 2013 Uncertainty should be reported! Papers published in

ISMM, Seattle, June 2013 Uncertainty should be reported! ignored uncertainty Papers published in

ISMM, Seattle, June 2013 How were the experiments performed?  Not always obvious if experiments were repeated.  Very few report that experiments repeat at more than one level, e.g.  Repeat executions (e.g. invocations of a JVM).  Repeat measurements (e.g. iterations of an application).  Number of repetitions: arbitrary or heuristic-based? 6

ISMM, Seattle, June 2013 One benchmark…  Good experimental methods take time 7

ISMM, Seattle, June 2013 A suite…  Good experimental methods take time 8

ISMM, Seattle, June 2013 Add invocations…  Good experimental methods take time 9

ISMM, Seattle, June 2013 and iterations…  Good experimental methods take time 10

ISMM, Seattle, June 2013 …and heap sizes  Good experimental methods take time 11

ISMM, Seattle, June 2013 A lost cause?  Is statistically rigorous experimental methodology simply infeasible? 12

ISMM, Seattle, June 2013 NO!  With some initial one-off investment,  We can cater for variation  Without excessive repetition (in most cases).  Our contributions:  A sound experimental methodology that makes best use of experiment time.  How to establish how much repetition is needed.  How to estimate error bounds. 13

ISMM, Seattle, June 2013 The Challenge of Reasonable Repetition  Variation at several stages of a benchmark experiment — iteration, execution, compilation…  Controlled variables platform, heap size or compiler options.  Random variables statistical properties.  Uncontrolled variables try to convert these to controlled or randomised (e.g. by randomising link order).  The challenge:  How to design efficient experiments given the random variables present, and  Summarise the results, with a confidence interval. 14

ISMM, Seattle, June 2013 Our running example  An experiment with 3 “levels” (though our technique is general): 1.Repeat compilation to create a binary — e.g. if code performance depends on layout. 2.Repeat executions of the same binary. 3.Repeat iterations of a benchmark. 15

ISMM, Seattle, June 2013 The Challenge of Summarising Results  Significance testing v visual tests v effect size confidence intervals 16

ISMM, Seattle, June 2013 Independent state  Researchers are typically interested in steady state performance.  Initialised state: no significant initialisation overhead.  Independent state: iteration times are (statistically) independent and identically distributed (IID).  Don’t repeat measurements before independence. If measurements are not IID, the variance and confidence interval estimates will be biased. 17

ISMM, Seattle, June 2013 Independent state  Does a benchmark reach an independent state? After how many iterations?  DaCapo/OpenJDK 7: ‘large’ and ‘small’ sizes 3 executions, 300 iterations/execution.  Inspect run-sequence, lag and auto-correlation plots for patterns indicating dependence. 18

ISMM, Seattle, June 2013 Independent state  Does a benchmark reach an independent state? After how many iterations?  DaCapo/OpenJDK 7: ‘large’ and ‘small’ sizes 3 executions, 300 iterations/execution.  Inspect run-sequence, lag and auto-correlation plots for patterns indicating dependence. RECOMMENDATION: Use this manual procedure just once to find how many iterations each benchmark, VM and platform combination requires to reach an independent state. 19

ISMM, Seattle, June 2013 Reached independent state? avrora 9 bloat 6 chart 6 eclipse 6 eclipse 9 fop 9 fop 6 h2 9 hsqldb 6 jython 6 jython 9 luindex 6 luindex 9 lusearch 9 pmd 6 pmd 9 sunflow 9 tomcat 9 tradebeans 9 tradesoap 9 xalan 6 xalan 9 Intel Xeon: 2 processors x 4 cores x 2-way HT DaCapo ‘small’ 20

ISMM, Seattle, June 2013 Reached independent state? avrora 9 bloat 6 chart 6 eclipse 6 eclipse 9 fop 9 fop 6 h2 9 hsqldb 6 jython 6 jython 9 luindex 6 luindex 9 lusearch 9 pmd 6 pmd 9 sunflow 9 tomcat 9 tradebeans 9 tradesoap 9 xalan 6 xalan 9 AMD Opteron: 4 processors x 16 cores DaCapo ‘small’ 21

ISMM, Seattle, June 2013 Reached independent state? avrora 9 bloat 6 chart 6 eclipse 6 eclipse 9 fop 9 fop 6 h2 9 hsqldb 6 jython 6 jython 9 luindex 6 luindex 9 lusearch 9 pmd 6 pmd 9 sunflow 9 tomcat 9 tradebeans 9 tradesoap 9 xalan 6 xalan 9 Intel Xeon: 2 processors x 4 cores x 2-way NT DaCapo ‘large’ 22

ISMM, Seattle, June 2013 Reached independent state? avrora 9 bloat 6 chart 6 eclipse 6 eclipse 9 fop 9 fop 6 h2 9 hsqldb 6 jython 6 jython 9 luindex 6 luindex 9 lusearch 9 pmd 6 pmd 9 sunflow 9 tomcat 9 tradebeans 9 tradesoap 9 xalan 6 xalan 9 AMD Opteron: 4 processors x 16 cores DaCapo ‘large’ 23

ISMM, Seattle, June 2013 Reached independent state? avrora 9 bloat 6 chart 6 eclipse 6 eclipse 9 fop 9 fop 6 h2 9 hsqldb 6 jython 6 jython 9 luindex 6 luindex 9 lusearch 9 pmd 6 pmd 9 sunflow 9 tomcat 9 tradebeans 9 tradesoap 9 xalan 6 xalan 9 AMD Opteron: 4 processors x 16 cores DaCapo ‘small’ 24

ISMM, Seattle, June 2013 Some benchmarks don’t reach independent state  Many benchmarks do not reach an independent state in a reasonable time.  Most have strong auto-dependencies.  Gradual drift in times and trends (increases and decreases); abrupt state changes; systematic transitions.  Choice of iteration significantly influences a result.  Problematic for online algorithms which distinguish small differences although the noise is many times larger.  Fortunately, trends tend to be consistent across runs. 25

ISMM, Seattle, June 2013 Some benchmarks don’t reach independent state  Many benchmarks do not reach an independent state in a reasonable time.  Most have strong auto-dependencies.  Gradual drift in times and trends (increases and decreases); abrupt state changes; systematic transitions.  Choice of iteration significantly influences a result.  Problematic for online algorithms which distinguish small differences although the noise is many times larger.  Fortunately, trends tend to be consistent across runs. RECOMMENDATION: If a benchmark does not reach an independent state in a reasonable time, take the same iteration from each run. 26

ISMM, Seattle, June 2013 Heuristics don’t do well InitialisedIndependentHarnessGeorges bloat248∞ chart341 eclipse5774 fop hsqldb66815 jython352 luindex1348 lusearch pmd741 xalan

ISMM, Seattle, June 2013 Heuristics don’t do well InitialisedIndependentHarnessGeorges bloat248∞ chart341 eclipse5774 fop hsqldb66815 jython352 luindex1348 lusearch pmd741 xalan Wastes time! 28

ISMM, Seattle, June 2013 Heuristics don’t do well InitialisedIndependentHarnessGeorges bloat248∞ chart341 eclipse5774 fop hsqldb66815 jython352 luindex1348 lusearch pmd741 xalan Unusable! 29

ISMM, Seattle, June 2013 Heuristics don’t do well InitialisedIndependentHarnessGeorges bloat248∞ chart341 eclipse5774 fop hsqldb66815 jython352 luindex1348 lusearch pmd741 xalan Initialised in reasonable time 30

ISMM, Seattle, June 2013 What to repeat?  Run a benchmark to independence and then repeat a number of iterations, collecting each result? or  Repeatedly, run a benchmark until it is initialised and then collect a single result?  The first method saves experimentation time if  variation between iterations > variation between executions,  initialisation warmup + VM initialisation is large, and  independence warmup is small. Variation % bloat 6 eclipse 9 lusearch 9 xalan 6 xalan 9 Iteration Execution AMD Opteron: 4 processors x 16 cores 31

ISMM, Seattle, June 2013 What to repeat?  Run a benchmark to independence and then repeat a number of iterations, collecting each result? or  Repeatedly, run a benchmark until it is initialised and then collect a single result?  The first method saves experimentation time if  variation between iterations > variation between executions,  initialisation warmup + VM initialisation is large, and  independence warmup is small. Variation % bloat 6 eclipse 9 lusearch 9 xalan 6 xalan 9 Iteration Execution AMD Opteron: 4 processors x 16 cores 32

ISMM, Seattle, June 2013 A clear but rigorous account  Goal : We want to quantify a performance optimisation in the form of an effect size confidence interval, e.g. “we are 95% confident that system A is faster than system B by 5.5% ± 2.5%”.  We need to repeat executions and take multiple measurements from each.  For a given experimental budget, we want to obtain the tightest possible confidence interval.  Adding repetition at the highest level always increases precision.  but it is often cheaper to add repetitions at lower levels. 33

ISMM, Seattle, June 2013 Multi-level repetition  How many repetitions to do at which levels? 1.Run an initial, dimensioning experiment  Gather the cost of a repetition at each level.  Iteration — time to complete an iteration.  Execution — more expensive, need to get to an independent state.  Calculate optimal repetition counts for the real experiment. 2.Run the real experiment.  Use the optimal repetition counts from the initial experiment.  Calculate the effect size confidence interval. 34

ISMM, Seattle, June 2013 Initial experiment  Choose arbitrary repetition counts r 1,…,r n  20 may be enough, 30 if possible, 10 if you must (e.g. if there are many levels)  Then, measure the cost of each level, e.g.  c 1 time to get an iteration (iteration duration).  c 2 time to get an execution (time to independent state).  c 3 time to get a binary (build time).  Also take the measurement times Y j n...j 1  Y 2,1,3 = time of the 3 rd non-warmup iteration from the 1 st execution of the 2 nd binary. Initial Experiment 35

ISMM, Seattle, June 2013 Variance estimators  First calculate n biased estimators S 1 2,…,S n 2  Then the unbiased estimators T i 2 iteratively Initial Experiment 36

ISMM, Seattle, June 2013 Variance estimators  First calculate n biased estimators S 1 2,…,S n 2  Then the unbiased estimators T i 2 iteratively Initial Experiment 37

ISMM, Seattle, June 2013 Optimal repetition counts  The optimal repetition counts to be used in the real experiments are r 1,…, r n- 1  We don’t calculate r n, the repetition count for the highest level  r n can always be increased for more precision.  Calculate the variance estimators S n 2 for the real experiment as before but using the optimal repetition counts r 1,…, r n-1 and the measurements from the real experiment. Real Experiment 38

ISMM, Seattle, June 2013 Confidence intervals Real Experiment  Asymptotic confidence interval with confidence (1 −  ) where is the (1-  / 2 )-quantile of the I-distribution with = r n -1 degrees of freedom.  See the ISMM’13 paper for details of constructing confidence intervals of execution time ratios.  See our technical report for proofs and gory details. 39

ISMM, Seattle, June 2013 Confidence interval for execution time ratios  Confidence interval due to Fieller (1954).  and are average execution times from the old and new systems.  Variance estimators S n 2 and S’ n 2 and half-widths h, h’ as before. 40

ISMM, Seattle, June 2013 In practise  For each benchmark/VM/platform…  Conduct a dimensioning experiment to establish the optimal repetition counts for each but the top level of the real experiment.  Redimension if only if the benchmark/VM/platform changes. 41

ISMM, Seattle, June 2013 DaCapo (revisited)  The confidence half-intervals using optimal repetition counts correspond closely to those obtained by running large numbers of executions (30) and iterations (40).  But repetition counts are much lower.  E.g. lusearch: r 1 =1 so time better spent repeating executions bloat 6 lusearch 9 xalan 6 xalan 9 c 1 (s) c 2 (s) r1r Half- intervals Optimal (%) Original (%) AMD Opteron: 4 processors x 16 cores 42

ISMM, Seattle, June 2013 Conclusions  Researchers should provide measures of variation when reporting results.  DaCapo and SPEC CPU benchmarks need very different repetition counts on different platforms before they reach an initialised or independent state.  Iteration execution times are often strongly auto- dependent: for these, automatic detection of steady state is not applicable. They can waste time or mislead.  An one-off (per benchmark/VM/platform) dimensioning experiment can provide the optimal counts for repetition at each level of the real experiments. 43

ISMM, Seattle, June 2013 RECOMMENDATION: Benchmark developers should include our dimensioning methodology as a one-off, per-system configuration requirement. 44

ISMM, Seattle, June

ISMM, Seattle, June 2013 Code layout experiments 46

ISMM, Seattle, June 2013 What’s of interest?  Mean execution times  Minimum threshold for ratio of execution times  Only interested in ‘significant’ performance changes  Improvements in systems research are often small, e.g. 10%.  Many factors influence performance  E.g. memory placement, randomised compilation algorithms, JIT compiler, symbol names…  [Mytkowicz et al., ASPLOS 2009; Gu et al, Component and middleware performance workshop 2004]  Randomisation to avoid measurement bias  E.g. Stabiliser tool [Curtsinger & Berger, UMass TR, 2012] 47

ISMM, Seattle, June 2013 Current best practice  Based on 2-level hierarchical experiments  Repeat measurements until standard deviation of last few measurements is small enough.  Quantify changes using a visual or statistical significance test  [Georges et al, OOPSLA 2007; PhD 2008]  Problems  Two levels are not always appropriate  Null hypothesis significance tests are deprecated in other sciences  Visual tests are overly conservative 48

ISMM, Seattle, June 2013 Null hypothesis significance tests  Null hypothesis: “the 2 systems have the same performance”  Tests if the null hypothesis can be rejected: “it is unlikely that the systems have the same performance”  Student’s t-test  Visual test 49

ISMM, Seattle, June 2013 Visual test  Construct confidence intervals  Do they overlap?  If not, it is unlikely that the systems have the same performance  [If only slight overlap — centre not covered by other CI — fall back to statistical test] 50

ISMM, Seattle, June 2013 What’s wrong with this? 1.It does not tell us what we want to know  Only if there is a performance change  We could also report the ratio of sample means  But we still don’t know how much of this change is due to uncertainty 2. The decision is affected by sample size  The larger the sample, the more unlikely even a small and meaningless change becomes  Its limitations have been known for 70 years  Deprecated in many fields: statistics, psychology, medicine, biology, chemistry, sociology, education, ecology… 51

ISMM, Seattle, June 2013 What’s wrong with this (cont.)? 3.Both tests use parametric methods that violate their assumptions  Performance measurements are not usually normally distributed  Multi-modal, long tails to the right  Good practice to check if data is close to normal  Robust methods are used in some fields  Should at least make assumptions clear  That using Student’s t-test is OK…  …Often it is OK 52

ISMM, Seattle, June 2013 Two methods  Statistical model of random effects in n-way classification  Use this model to construct effect size confidence interval for the ratio of the means of execution time. 1.A parametric method based on asymptotic normality 2.A non-parametric method based on statistical simulation (‘bootstrap’) 53

ISMM, Seattle, June 2013 Quantifying the performance (1)  Parametric method  Use the same number of repetitions for the old ( O Y) and new ( N Y) system.  Report (1-  ) confidence interval (e.g.  =0.05 for 95% CI)  t a/2, denotes the  /2-quantile of the t-distribution with  = n n degrees of freedom 54

ISMM, Seattle, June 2013 Quantifying performance (2)  Bootstrap method 1.Perform many simulations (1000 or more if there is time)  Use real data within each simulated step 2.Randomly choose the values to use at each level  Replacement at all levels seems safe 3.Calculate many sample means from these  Asymptotically normal due to the Central Limit Theorem  Form a (1-  ) CI by using the  / 2 and 1-  / 2 sample quantiles  E.g. order the values and use the 25 th and 975 th 55

ISMM, Seattle, June 2013 Parametric vs. bootstrap  Bootstrap is more robust than parametric method  Uses fewer assumptions  Does not depend on underlying distribution  No need to check if data is reasonably close to normal  Can be used with other metrics, e.g. medians  Parametric method is more confident  Narrower confidence intervals  More likely to find a significant difference 56