Presentation on theme: "Empirical Methods for AI & CS Paul Cohen Ian P. Gent Toby Walsh"— Presentation transcript:
Empirical Methods for AI & CS Paul Cohen Ian P. Gent Toby Walsh
2 Overview zIntroduction yWhat are empirical methods? yWhy use them? zCase Study yEight Basic Lessons zExperiment design zData analysis zHow not to do it zSupplementary material
3 Resources zWeb zBooks Empirical Methods for AI, Paul Cohen, MIT Press, 1995 zJournals Journal of Experimental Algorithmics, zConferences Workshop on Empirical Methods in AI (last Saturday, ECAI-02?) Workshop on Algorithm Engineering and Experiments, ALENEX 01 (alongside SODA)
Empirical Methods for CS Part I :Introduction
5 What does empirical mean? zRelying on observations, data, experiments zEmpirical work should complement theoretical work yTheories often have holes (e.g., How big is the constant term? Is the current problem a bad one?) yTheories are suggested by observations yTheories are tested by observations yConversely, theories direct our empirical attention zIn addition (in this tutorial at least) empirical means wanting to understand behavior of complex systems
6 Why We Need Empirical Methods Cohen, 1990 Survey of 150 AAAI Papers zRoughly 60% of the papers gave no evidence that the work they described had been tried on more than a single example problem. zRoughly 80% of the papers made no attempt to explain performance, to tell us why it was good or bad and under which conditions it might be better or worse. zOnly 16% of the papers offered anything that might be interpreted as a question or a hypothesis. zTheory papers generally had no applications or empirical work to support them, empirical papers were demonstrations, not experiments, and had no underlying theoretical support. zThe essential synergy between theory and empirical work was missing
7 Theory, not Theorems zTheory based science need not be all theorems yotherwise science would be mathematics zConsider theory of QED ybased on a model of behaviour of particles ypredictions accurate to many decimal places (9?) xmost accurate theory in the whole of science? ysuccess derived from accuracy of predictions xnot the depth or difficulty or beauty of theorems yQED is an empirical theory!
8 Empirical CS/AI zComputer programs are formal objects yso lets reason about them entirely formally? zTwo reasons why we cant or wont: ytheorems are hard ysome questions are empirical in nature e.g. are Horn clauses adequate to represent the sort of knowledge met in practice? e.g. even though our problem is intractable in general, are the instances met in practice easy to solve?
9 Empirical CS/AI zTreat computer programs as natural objects ylike fundamental particles, chemicals, living organisms zBuild (approximate) theories about them yconstruct hypotheses e.g. greedy hill-climbing is important to GSAT ytest with empirical experiments e.g. compare GSAT with other types of hill-climbing yrefine hypotheses and modelling assumptions e.g. greediness not important, but hill-climbing is!
10 Empirical CS/AI zMany advantage over other sciences zCost yno need for expensive super-colliders zControl yunlike the real world, we often have complete command of the experiment zReproducibility yin theory, computers are entirely deterministic zEthics yno ethics panels needed before you run experiments
11 Types of hypothesis zMy search program is better than yours not very helpful beauty competition? zSearch cost grows exponentially with number of variables for this kind of problem better as we can extrapolate to data not yet seen? zConstraint systems are better at handling over-constrained systems, but OR systems are better at handling under- constrained systems even better as we can extrapolate to new situations?
12 A typical conference conversation What are you up to these days? Im running an experiment to compare the Davis-Putnam algorithm with GSAT? Why? I want to know which is faster Why? Lots of people use each of these algorithms How will these people use your result?...
13 Keep in mind the BIG picture What are you up to these days? Im running an experiment to compare the Davis-Putnam algorithm with GSAT? Why? I have this hypothesis that neither will dominate What use is this? A portfolio containing both algorithms will be more robust than either algorithm on its own
14 Keep in mind the BIG picture... Why are you doing this? Because many real problems are intractable in theory but need to be solved in practice. How does your experiment help? It helps us understand the difference between average and worst case results So why is this interesting? Intractability is one of the BIG open questions in CS!
15 Why is empirical CS/AI in vogue? zInadequacies of theoretical analysis yproblems often arent as hard in practice as theory predicts in the worst-case yaverage-case analysis is very hard (and often based on questionable assumptions) zSome spectacular successes yphase transition behaviour ylocal search methods ytheory lagging behind algorithm design
16 Why is empirical CS/AI in vogue? zCompute power ever increasing yeven intractable problems coming into range yeasy to perform large (and sometimes meaningful) experiments zEmpirical CS/AI perceived to be easier than theoretical CS/AI yoften a false perception as experiments easier to mess up than proofs
Empirical Methods for CS Part II: A Case Study Eight Basic Lessons
18 Rosenberg study zAn Empirical Study of Dynamic Scheduling on Rings of Processors Gregory, Gao, Rosenberg & Cohen Proc. of 8th IEEE Symp. on Parallel & Distributed Processing, 1996 Linked to from
19 Problem domain zScheduling processors on ring network yjobs spawned as binary trees zKOSO ykeep one, send one to my left or right arbitrarily zKOSO* ykeep one, send one to my least heavily loaded neighbour
20 Theory zOn complete binary trees, KOSO is asymptotically optimal zSo KOSO* cant be any better? zBut assumptions unrealistic ytree not complete yasymptotically not necessarily the same as in practice! Thm: Using KOSO on a ring of p processors, a binary tree of height n is executed within (2^n-1)/p + low order terms
21 Benefits of an empirical study zMore realistic trees yprobabilistic generator that makes shallow trees, which are bushy near root but quickly get scrawny ysimilar to trees generated when performing Trapezoid or Simpsons Rule calculations xbinary trees correspond to interval bisection zStartup costs ynetwork must be loaded
22 Lesson 1: Evaluation begins with claims Lesson 2: Demonstration is good, understanding better zHypothesis (or claim): KOSO takes longer than KOSO* because KOSO* balances loads better yThe because phrase indicates a hypothesis about why it works. This is a better hypothesis than the beauty contest demonstration that KOSO* beats KOSO zExperiment design yIndependent variables: KOSO v KOSO*, no. of processors, no. of jobs, probability(job will spawn), yDependent variable: time to complete jobs
23 Criticism 1: This experiment design includes no direct measure of the hypothesized effect zHypothesis: KOSO takes longer than KOSO* because KOSO* balances loads better zBut experiment design includes no direct measure of load balancing: yIndependent variables: KOSO v KOSO*, no. of processors, no. of jobs, probability(job will spawn), yDependent variable: time to complete jobs
24 Lesson 3: Exploratory data analysis means looking beneath immediate results for explanations zT-test on time to complete jobs: t = ( )/587 = -.19 zKOSO* apparently no faster than KOSO (as theory predicted) zWhy? Look more closely at the data: zOutliers create excessive variance, so test isnt significant KOSOKOSO*
25 Lesson 4: The task of empirical work is to explain variability run-time Algorithm (KOSO/KOSO*) Number of processors Number of jobs random noise (e.g., outliers) Number of processors and number of jobs explain 74% of the variance in run time. Algorithm explains almost none. Empirical work assumes the variability in a dependent variable (e.g., run time) is the sum of causal factors and random noise. Statistical methods assign parts of this variability to the factors and the noise.
26 Lesson 3 (again): Exploratory data analysis means looking beneath immediate results for explanations zWhy does the KOSO/KOSO* choice account for so little of the variance in run time? zUnless processors starve, there will be no effect of load balancing. In most conditions in this experiment, processors never starved. (This is why we run pilot experiments!)
27 Lesson 5: Of sample variance, effect size, and sample size – control the first before touching the last t x s N magnitude of effect background variance sample size This intimate relationship holds for all statistics
28 Lesson 5 illustrated: A variance reduction method Let N = num-jobs, P = num-processors, T = run time Then T = k (N / P), or k multiples of the theoretical best time And k = 1 / (N / P T) k(KOSO)k(KOSO*)
29 Where are we? zKOSO* is significantly better than KOSO when the dependent variable is recoded as percentage of optimal run time zThe difference between KOSO* and KOSO explains very little of the variance in either dependent variable zExploratory data analysis tells us that processors arent starving so we shouldnt be surprised zPrediction: The effect of algorithm on run time (or k) increases as the number of jobs increases or the number of processors increases zThis prediction is about interactions between factors
30 Lesson 6: Most interesting science is about interaction effects, not simple main effects zData confirm prediction yKOSO* is superior on larger rings where starvation is an issue zInteraction of independent variables ychoice of algorithm ynumber of processors zInteraction effects are essential to explaining how things work number of processors multiples of optimal run-time KOSO KOSO*
31 Lesson 7: Significant and meaningful are not synonymous. Is a result meaningful? zKOSO* is significantly better than KOSO, but can you use the result? zSuppose you wanted to use the knowledge that the ring is controlled by KOSO or KOSO* for some prediction. yGrand median k = 1.11; Pr(trial i has k > 1.11) =.5 yPr(trial i under KOSO has k > 1.11) = 0.57 yPr(trial i under KOSO* has k > 1.11) = 0.43 zPredict for trial i whether its k is above or below the median: yIf its a KOSO* trial youll say no with (.43 * 150) = 64.5 errors yIf its a KOSO trial youll say yes with ((1 -.57) * 160) = 68.8 errors yIf you dont know youll make (.5 * 310) = 155 errors z155 - ( ) = 22 zKnowing the algorithm reduces error rate from.5 to.43. Is this enough???
32 Lesson 8: Keep the big picture in mind Why are you studying this? Load balancing is important to get good performance out of parallel computers Why is this important? Parallel computing promises to tackle many of our computational bottlenecks How do we know this? Its in the first paragraph of the paper!
33 Case study: conclusions zEvaluation begins with claims zDemonstrations of simple main effects are good, understanding the effects is better zExploratory data analysis means using your eyes to find explanatory patterns in data zThe task of empirical work is to explain variablitity zControl variability before increasing sample size zInteraction effects are essential to explanations zSignificant meaningful zKeep the big picture in mind
Empirical Methods for CS Part III :Experiment design
35 Experimental Life Cycle zExploration zHypothesis construction zExperiment zData analysis zDrawing of conclusions
36 Checklist for experiment design * zConsider the experimental procedure ymaking it explicit helps to identify spurious effects and sampling biases zConsider a sample data table yidentifies what results need to be collected yclarifies dependent and independent variables yshows whether data pertain to hypothesis zConsider an example of the data analysis yhelps you to avoid collecting too little or too much data yespecially important when looking for interactions *From Chapter 3, Empirical Methods for Artificial Intelligence, Paul Cohen, MIT Press
37 Guidelines for experiment design zConsider possible results and their interpretation ymay show that experiment cannot support/refute hypotheses under test yunforeseen outcomes may suggest new hypotheses zWhat was the question again? yeasy to get carried away designing an experiment and lose the BIG picture zRun a pilot experiment to calibrate parameters (e.g., number of processors in Rosenberg experiment)
39 Manipulation experiment zIndependent variable, x yx=identity of parser, size of dictionary, … zDependent variable, y yy=accuracy, speed, … zHypothesis yx influences y zManipulation experiment ychange x, record y
40 Observation experiment zPredictor, x yx=volatility of stock prices, … zResponse variable, y yy=fund performance, … zHypothesis yx influences y zObservation experiment yclassify according to x, compute y
41 Factorial experiment zSeveral independent variables, x i ythere may be no simple causal links ydata may come that way e.g. individuals will have different sexes, ages,... zFactorial experiment yevery possible combination of x i considered yexpensive as its name suggests!
42 Designing factorial experiments zIn general, stick to 2 to 3 independent variables zSolve same set of problems in each case yreduces variance due to differences between problem sets zIf this not possible, use same sample sizes ysimplifies statistical analysis zAs usual, default hypothesis is that no influence exists ymuch easier to fail to demonstrate influence than to demonstrate an influence
43 Some problem issues zControl zCeiling and Floor effects zSampling Biases
44 Control zA control is an experiment in which the hypothesised variation does not occur yso the hypothesized effect should not occur either zBUT remember yplacebos cure a large percentage of patients!
45 Control: a cautionary tale zMacaque monkeys given vaccine based on human T-cells infected with SIV (relative of HIV) ymacaques gained immunity from SIV zLater, macaques given uninfected human T-cells yand macaques still gained immunity! zControl experiment not originally done yand not always obvious (you cant control for all variables)
46 Control: MYCIN case study zMYCIN was a medial expert system yrecommended therapy for blood/meningitis infections zHow to evaluate its recommendations? zShortliffe used y10 sample problems, 8 therapy recommenders x5 faculty, 1 resident, 1 postdoc, 1 student y8 impartial judges gave 1 point per problem ymax score was 80 yMycin 65, faculty 40-60, postdoc 60, resident 45, student 30
47 Control: MYCIN case study zWhat were controls? zControl for judges bias for/against computers yjudges did not know who recommended each therapy zControl for easy problems ymedical student did badly, so problems not easy zControl for our standard being low ye.g. random choice should do worse zControl for factor of interest ye.g. hypothesis in MYCIN that knowledge is power yhave groups with different levels of knowledge
48 Ceiling and Floor Effects zWell designed experiments (with good controls) can still go wrong zWhat if all our algorithms do particularly well yOr they all do badly? zWeve got little evidence to choose between them
49 Ceiling and Floor Effects zCeiling effects arise when test problems are insufficiently challenging yfloor effects the opposite, when problems too challenging zA problem in AI because we often repeatedly use the same benchmark sets ymost benchmarks will lose their challenge eventually? ybut how do we detect this effect?
50 Machine learning example z14 datasets from UCI corpus of benchmarks yused as mainstay of ML community zProblem is learning classification rules yeach item is vector of features and a classification ymeasure classification accuracy of method (max 100%) zCompare C4 with 1R*, two competing algorithms Rob Holte, Machine Learning, vol. 3, pp , 1993
51 Floor effects: machine learning example DataSet:BCCHGLG2HDHE…Mean C R* Is 1R* above the floor of performance? How would we tell?
52 Floor effects: machine learning example DataSet:BCCHGLG2HDHE…Mean C R* Baseline … 59.9 Baseline rule puts all items in more popular category. 1R* is above baseline on most datasets A bit like the prime number joke? 1 is prime. 3 is prime. 5 is prime. So, baseline rule is that all odd numbers are prime.
53 Ceiling Effects: machine learning DataSet:BCGLHYLYMU…Mean C R* yHow do we know that C4 and 1R* are not near the ceiling of performance? yDo the datasets have enough attributes to make perfect classification? xObviously for MU, but what about the rest?
54 Ceiling Effects: machine learning DataSet:BCGLHYLYMU…Mean C R* max(C4,1R*) …87.4 max([Buntine]) …82.0 yC4 achieves only about 2% better than 1R* xBest of the C4/1R* achieves 87.4% accuracy yWe have only weak evidence that C4 better yBoth methods performing appear to be near ceiling of possible so comparison hard!
55 Ceiling Effects: machine learning zIn fact 1R* only uses one feature (the best one) zC4 uses on average 6.6 features z5.6 features buy only about 2% improvement zConclusion? yEither real world learning problems are easy (use 1R*) yOr we need more challenging datasets yWe need to be aware of ceiling effects in results
56 Sampling bias zData collection is biased against certain data ye.g. teacher who says Girls dont answer maths question yobservation might suggest: xgirls dont answer many questions xbut that the teacher doesnt ask them many questions zExperienced AI researchers dont do that, right?
57 Sampling bias: Phoenix case study zAI system to fight (simulated) forest fires zExperiments suggest that wind speed uncorrelated with time to put out fire yobviously incorrect as high winds spread forest fires
58 Sampling bias: Phoenix case study zWind Speed vs containment time (max 150 hours): 3: : : zWhats the problem?
59 Sampling bias: Phoenix case study zThe cut-off of 150 hours introduces sampling bias ymany high-wind fires get cut off, not many low wind zOn remaining data, there is no correlation between wind speed and time (r = -0.53) zIn fact, data shows that: ya lot of high wind fires take > 150 hours to contain ythose that dont are similar to low wind fires zYou wouldnt do this, right? yyou might if you had automated data analysis.
60 Sampling biases can be subtle... zAssume gender (G) is an independent variable and number of siblings (S) is a noise variable. zIf S is truly a noise variable then under random sampling, no dependency should exist between G and S in samples. zParents have children until they get at least one boy. They don't feel the same way about girls. In a sample of 1000 girls the number with S = 0 is smaller than in a sample of 1000 boys. zThe frequency distribution of S is different for different genders. S and G are not independent. zGirls do better at math than boys in random samples at all levels of education. zIs this because of their genes or because they have more siblings? zWhat else might be systematically associated with G that we don't know about?
Empirical Methods for CS Part IV: Data analysis
62 Kinds of data analysis zExploratory (EDA) – looking for patterns in data zStatistical inferences from sample data yTesting hypotheses yEstimating parameters zBuilding mathematical models of datasets zMachine learning, data mining… zWe will introduce hypothesis testing and computer-intensive methods
63 The logic of hypothesis testing zExample: toss a coin ten times, observe eight heads. Is the coin fair (i.e., what is its long run behavior?) and what is your residual uncertainty? zYou say, If the coin were fair, then eight or more heads is pretty unlikely, so I think the coin isnt fair. zLike proof by contradiction: Assert the opposite (the coin is fair) show that the sample result ( 8 heads) has low probability p, reject the assertion, with residual uncertainty related to p. zEstimate p with a sampling distribution.
64 Probability of a sample result under a null hypothesis zIf the coin were fair (p=.5, the null hypothesis) what is the probability distribution of r, the number of heads, obtained in N tosses of a fair coin? Get it analytically or estimate it by simulation (on a computer): yLoop K times xr := 0;; r is num.heads in N tosses xLoop N times;; simulate the tosses Generate a random 0 x 1.0 If x < p increment r;; p is the probability of a head x Push r onto sampling_distribution yPrint sampling_distribution
65 Sampling distributions This is the estimated sampling distribution of r under the null hypothesis that p =.5. The estimation is constructed by Monte Carlo sampling Number of heads in 10 tosses Frequency (K = 1000)Probability of r = 8 or more heads in N = 10 tosses of a fair coin is 54 / 1000 =.054
66 The logic of hypothesis testing Establish a null hypothesis: H0: p =.5, the coin is fair zEstablish a statistic: r, the number of heads in N tosses zFigure out the sampling distribution of r given H0 zThe sampling distribution will tell you the probability p of a result at least as extreme as your sample result, r = 8 zIf this probability is very low, reject H0 the null hypothesis zResidual uncertainty is p
67 The only tricky part is getting the sampling distribution zSampling distributions can be derived... yExactly, e.g., binomial probabilities for coins are given by the formula yAnalytically, e.g., the central limit theorem tells us that the sampling distribution of the mean approaches a Normal distribution as samples grow to infinity yEstimated by Monte Carlo simulation of the null hypothesis process
68 A common statistical test: The Z test for different means zA sample N = 25 computer science students has mean IQ m=135. Are they smarter than average? zPopulation mean is 100 with standard deviation 15 zThe null hypothesis, H0, is that the CS students are average, i.e., the mean IQ of the population of CS students is 100. zWhat is the probability p of drawing the sample if H0 were true? If p small, then H0 probably false. zFind the sampling distribution of the mean of a sample of size 25, from population with mean 100
69 Central Limit Theorem: The sampling distribution of the mean is given by the Central Limit Theorem The sampling distribution of the mean of samples of size N approaches a normal (Gaussian) distribution as N approaches infinity. If the samples are drawn from a population with mean and standard deviation, then the mean of the sampling distribution is and its standard deviation is as N increases. These statements hold irrespective of the shape of the original distribution.
70 The sampling distribution for the CS student example zIf sample of N = 25 students were drawn from a population with mean 100 and standard deviation 15 (the null hypothesis) then the sampling distribution of the mean would asymptotically be normal with mean 100 and standard deviation The mean of the CS students falls nearly 12 standard deviations away from the mean of the sampling distribution Only ~1% of a normal distribution falls more than two standard deviations away from the mean The probability that the students are average is roughly zero
71 The Z test Mean of sampling distribution Sample statistic std= Mean of sampling distribution Test statistic std=1.0
72 Reject the null hypothesis? zCommonly we reject the H0 when the probability of obtaining a sample statistic (e.g., mean = 135) given the null hypothesis is low, say <.05. zA test statistic value, e.g. Z = 11.67, recodes the sample statistic (mean = 135) to make it easy to find the probability of sample statistic given H0. zWe find the probabilities by looking them up in tables, or statistics packages provide them. yFor example, Pr(Z 1.67) =.05; Pr(Z 1.96) =.01. zPr(Z 11) is approximately zero, reject H0.
73 The t test zSame logic as the Z test, but appropriate when population standard deviation is unknown, samples are small, etc. zSampling distribution is t, not normal, but approaches normal as samples size increases zTest statistic has very similar form but probabilities of the test statistic are obtained by consulting tables of the t distribution, not the normal
74 The t test Mean of sampling distribution Sample statistic std= Mean of sampling distribution Test statistic std=1.0 Suppose N = 5 students have mean IQ = 135, std = 27 Estimate the standard deviation of sampling distribution using the sample standard deviation
75 Summary of hypothesis testing zH0 negates what you want to demonstrate; find probability p of sample statistic under H0 by comparing test statistic to sampling distribution; if probability is low, reject H0 with residual uncertainty proportional to p. zExample: Want to demonstrate that CS graduate students are smarter than average. H0 is that they are average. t = 2.89, p.022 zHave we proved CS students are smarter? NO! z We have only shown that mean = 135 is unlikely if they arent. We never prove what we want to demonstrate, we only reject H0, with residual uncertainty. zAnd failing to reject H0 does not prove H0, either!
76 Common tests zTests that means are equal zTests that samples are uncorrelated or independent zTests that slopes of lines are equal zTests that predictors in rules have predictive power zTests that frequency distributions (how often events happen) are equal zTests that classification variables such as smoking history and heart disease history are unrelated... zAll follow the same basic logic
77 Computer-intensive Methods zBasic idea: Construct sampling distributions by simulating on a computer the process of drawing samples. zThree main methods: yMonte carlo simulation when one knows population parameters; yBootstrap when one doesnt; yRandomization, also assumes nothing about the population. zEnormous advantage: Works for any statistic and makes no strong parametric assumptions (e.g., normality)
78 Another Monte Carlo example, relevant to machine learning... zSuppose you want to buy stocks in a mutual fund; for simplicity assume there are just N = 50 funds to choose from and youll base your decision on the proportion of J=30 stocks in each fund that increased in value zSuppose Pr(a stock increasing in price) =.75 zYou are tempted by the best of the funds, F, which reports price increases in 28 of its 30 stocks. zWhat is the probability of this performance?
79 Simulate... Loop K = 1000 times B = 0;; number of stocks that increase in ;; the best of N funds Loop N = 50 times;; N is number of funds H = 0;; stocks that increase in this fund Loop M = 30 times;; M is number of stocks in this fund Toss a coin with bias p to decide whether this stock increases in value and if so increment H Push H on a list ;; We get N values of H B := maximum(H);; The number of increasing stocks in ;; the best fund Push B on a list;; We get K values of B
80 Surprise! zThe probability that the best of 50 funds reports 28 of 30 stocks increase in price is roughly 0.4 zWhy? The probability that an arbitrary fund would report this increase is Pr(28 successes | pr(success)=.75).01, but the probability that the best of 50 funds would report this is much higher. zMachine learning algorithms use critical values based on arbitrary elements, when they are actually testing the best element; they think elements are more unusual than they really are. This is why ML algorithms overfit.
81 The Bootstrap zMonte Carlo estimation of sampling distributions assume you know the parameters of the population from which samples are drawn. zWhat if you dont? zUse the sample as an estimate of the population. zDraw samples from the sample! zWith or without replacement? zExample: Sampling distribution of the mean; check the results against the central limit theorem.
82 Bootstrapping the sampling distribution of the mean* zS is a sample of size N: Loop K = 1000 times Draw a pseudosample S* of size N from S by sampling with replacement Calculate the mean of S* and push it on a list L zL is the bootstrapped sampling distribution of the mean** zThis procedure works for any statistic, not just the mean. * Recall we can get the sampling distribution of the mean via the central limit theorem – this example is just for illustration. ** This distribution is not a null hypothesis distribution and so is not directly used for hypothesis testing, but can easily be transformed into a null hypothesis distribution (see Cohen, 1995).
83 Randomization zUsed to test hypotheses that involve association between elements of two or more groups; very general. zExample: Paul tosses H H H H, Carole tosses T T T T is outcome independent of tosser? zExample: 4 women score , six men score Is score independent of gender? Basic procedure: Calculate a statistic f for your sample; randomize one factor relative to the other and calculate your pseudostatistic f*. Compare f to the sampling distribution for f*.
84 Example of randomization zFour women score , six men score Is score independent of gender? zf = difference of means of mens and womens scores: zUnder the null hypothesis of no association between gender and score, the score 54 might equally well have been achieved by a male or a female. zToss all scores in a hopper, draw out four at random and without replacement, call them female*, call the rest male*, and calculate f*, the difference of means of female* and male*. Repeat to get a distribution of f*. This is an estimate of the sampling distribution of f under H0: no difference between male and female scores.
Empirical Methods for CS Part V: How Not To Do It
86 Tales from the coal face zThose ignorant of history are doomed to repeat it ywe have committed many howlers zWe hope to help others avoid similar ones … … and illustrate how easy it is to screw up! zHow Not to Do It I Gent, S A Grant, E. MacIntyre, P Prosser, P Shaw, B M Smith, and T Walsh University of Leeds Research Report, May 1997 zEvery howler we report committed by at least one of the above authors!
87 How Not to Do It zDo measure with many instruments yin exploring hard problems, we used our best algorithms ymissed very poor performance of less good algorithms better algorithms will be bitten by same effect on larger instances than we considered zDo measure CPU time yin exploratory code, CPU time often misleading ybut can also be very informative e.g. heuristic needed more search but was faster
88 How Not to Do It zDo vary all relevant factors zDont change two things at once yascribed effects of heuristic to the algorithm xchanged heuristic and algorithm at the same time xdidnt perform factorial experiment ybut its not always easy/possible to do the right experiments if there are many factors
89 How Not to Do It zDo Collect All Data Possible …. (within reason) yone year Santa Claus had to repeat all our experiments xECAI/AAAI/IJCAI deadlines just after new year! ywe had collected number of branches in search tree xperformance scaled with backtracks, not branches xall experiments had to be rerun zDont Kill Your Machines ywe have got into trouble with sysadmins … over experimental data we never used yoften the vital experiment is small and quick
90 How Not to Do It zDo It All Again … (or at least be able to) ye.g. storing random seeds used in experiments ywe didnt do that and might have lost important result zDo Be Paranoid yidentical implementations in C, Scheme gave different results zDo Use The Same Problems yreproducibility is a key to science (c.f. cold fusion) ycan reduce variance
91 Choosing your test data zWeve seen the possible problem of over-fitting yremember machine learning benchmarks? zTwo common approaches ybenchmark libraries yrandom problems zBoth have potential pitfalls
92 Benchmark libraries z+ve ycan be based on real problems ylots of structure z-ve ylibrary of fixed size possible to over-fit algorithms to library yproblems have fixed size so cant measure scaling
93 Random problems z+ve yproblems can have any size so can measure scaling ycan generate any number of problems hard to over-fit? z-ve ymay not be representative of real problems xlack structure yeasy to generate flawed problems xCSP, QSAT, …
94 Flawed random problems zConstraint satisfaction example y40+ papers over 5 years by many authors used Models A, B, C, and D yall four models are flawed [Achlioptas et al. 1997] xasymptotically almost all problems are trivial xbrings into doubt many experimental results some experiments at typical sizes affected fortunately not many yHow should we generate problems in future?
95 Flawed random problems z[Gent et al. 1998] fix flaw …. yintroduce flawless problem generation ydefined in two equivalent ways ythough no proof that problems are truly flawless zUndergraduate student at Strathclyde found new bug ytwo definitions of flawless not equivalent zEventually settled on final definition of flawless ygave proof of asymptotic non-triviality yso we think that we just about understand the problem generator now
96 Prototyping your algorithm zOften need to implement an algorithm yusually novel algorithm, or variant of existing one xe.g. new heuristic in existing search algorithm ynovelty of algorithm should imply extra care ymore often, encourages lax implementation xits only a preliminary version
97 How Not to Do It zDont Trust Yourself ybug in innermost loop found by chance yall experiments re-run with urgent deadline ycuriously, sometimes bugged version was better! zDo Preserve Your Code yOr end up fixing the same error twice Do use version control!
98 How Not to Do It zDo Make it Fast Enough yemphasis on enough xits often not necessary to have optimal code xin lifecycle of experiment, extra coding time not won back ye.g. we have published many papers with inefficient code xcompared to state of the art first GSAT version O(N 2 ), but this really was too slow! Do Report Important Implementation Details xIntermediate versions produced good results
99 How Not to Do It zDo Look at the Raw Data ySummaries obscure important aspects of behaviour yMany statistical measures explicitly designed to minimise effect of outliers ySometimes outliers are vital xexceptionally hard problems dominate mean xwe missed them until they hit us on the head when experiments crashed overnight old data on smaller problems showed clear behaviour
100 How Not to Do It zDo face up to the consequences of your results ye.g. preprocessing on 450 problems xshould obviously reduce search xreduced search 448 times xincreased search 2 times yForget algorithm, its useless? yOr study in detail the two exceptional cases xand achieve new understanding of an important algorithm
Empirical Methods for CS Part VII : Coda
102 Our objectives zOutline some of the basic issues yexploration, experimental design, data analysis,... zEncourage you to consider some of the pitfalls ywe have fallen into all of them! zRaise standards yencouraging debate yidentifying best practice zLearn from your experiences yexperimenters get better as they get older!
103 Summary zEmpirical CS and AI are exacting sciences yThere are many ways to do experiments wrong zWe are experts in doing experiments badly zAs you perform experiments, youll make many mistakes zLearn from those mistakes, and ours!
Empirical Methods for CS Part VII : Supplement
105 Some expert advice zBernard Moret, U. New Mexico Towards a Discipline of Experimental Algorithmics zDavid Johnson, AT&T Labs A Theoreticians Guide to the Experimental Analysis of Algorithms Both linked to from
106 Bernard Morets guidelines zUseful types of empirical results: yaccuracy/correctness of theoretical results yreal-world performance yheuristic quality yimpact of data structures y...
107 Bernard Morets guidelines zHallmarks of a good experimental paper yclearly defined goals ylarge scale tests both in number and size of instances ymixture of problems real-world, random, standard benchmarks,... ystatistical analysis of results yreproducibility publicly available instances, code, data files,...
108 Bernard Morets guidelines zPitfalls for experimental papers ysimpler experiment would have given same result yresult predictable by (back of the envelope) calculation ybad experimental setup e.g. insufficient sample size, no consideration of scaling, … ypoor presentation of data e.g. lack of statistics, discarding of outliers,...
109 Bernard Morets guidelines zIdeal experimental procedure ydefine clear set of objectives which questions are you asking? ydesign experiments to meet these objectives ycollect data do not change experiments until all data is collected to prevent drift/bias yanalyse data consider new experiments in light of these results
110 David Johnsons guidelines z3 types of paper describe the implementation of an algorithm yapplication paper Heres a good algorithm for this problem ysales-pitch paper Heres an interesting new algorithm yexperimental paper Heres how this algorithm behaves in practice zThese lessons apply to all 3
111 David Johnsons guidelines zPerform newsworthy experiments ystandards higher than for theoretical papers! yrun experiments on real problems theoreticians can get away with idealized distributions but experimentalists have no excuse! ydont use algorithms that theory can already dismiss ylook for generality and relevance dont just report algorithm A dominates algorithm B, identify why it does!
112 David Johnsons guidelines zPlace work in context ycompare against previous work in literature yideally, obtain their code and test sets verify their results, and compare with your new algorithm yless ideally, re-implement their code report any differences in performance yleast ideally, simply report their old results try to make some ball-park comparisons of machine speeds
113 David Johnsons guidelines zUse efficient implementations ysomewhat controversial yefficient implementation supports claims of practicality tells us what is achievable in practice ycan run more experiments on larger instances can do our research quicker! ydont have to go over-board on this yexceptions can also be made e.g. not studying CPU time, comparing against a previously newsworthy algorithm, programming time more valuable than processing time,...
114 David Johnsons guidelines zUse testbeds that support general conclusions yideally one (or more) random class, & real world instances predict performance on real world problems based on random class, evaluate quality of predictions ystructured random generators parameters to control structure as well as size ydont just study real world instances hard to justify generality unless you have a very broad class of real world problems!
115 David Johnsons guidelines zProvide explanations and back them up with experiment yadds to credibility of experimental results yimproves our understanding of algorithms leading to better theory and algorithms ycan weed out bugs in your implementation!
116 David Johnsons guidelines zEnsure reproducibilty yeasily achieved via the Web yadds support to a paper if others can (and do) reproduce the results yrequires you to use large samples and wide range of problems otherwise results will not be reproducible!
117 David Johnsons guidelines zEnsure comparability (and give the full picture) ymake it easy for those who come after to reproduce your results yprovide meaningful summaries give sample sizes, report standard deviations, plot graphs but report data in tables in the appendix ydo not hide anomalous results yreport running times even if this is not the main focus readers may want to know before studying your results in detail
118 David Johnsons pitfalls zFailing to report key implementation details zExtrapolating from tiny samples zUsing irreproducible benchmarks zUsing running time as a stopping criterion zIgnoring hidden costs (e.g. preprocessing) zMisusing statistical tools zFailing to use graphs
119 David Johnsons pitfalls zObscuring raw data by using hard-to-read charts zComparing apples and oranges zDrawing conclusions not supported by the data zLeaving obvious anomalies unnoted/unexplained zFailing to back up explanations with further experiments zIgnoring the literature the self-referential study!