Presentation on theme: "Intelligence Artificial Intelligence Ian Gent Empirical Evaluation of AI Systems, 2."— Presentation transcript:
Intelligence Artificial Intelligence Ian Gent firstname.lastname@example.org Empirical Evaluation of AI Systems, 2
Intelligence Artificial Intelligence Exploratory Data Analysis, or … How NOT To Do It
3 Tales from the Coal Face zThose ignorant of history are doomed to repeat it zWe have committed many howlers in experiment zWe hope to help others avoid similar ones … y… and illustrate how easy it is to screw up! z“How Not To Do It” yI Gent, S A Grant, E. MacIntyre, P Prosser, P Shaw, B M Smith, and T Walsh yUniversity of Leeds Research Report, May 1997 yEvery howler we report committed by at least one of the above authors!
4 Experimental Life Cycle zGetting Started zExploratory Data Analysis zProblems with Benchmark Problems zAnalysis of Data zPresenting Results
5 Getting Started zTo get started, you need your experimental algorithm yusually novel algorithm, or variant of existing one xe.g. new heuristic in existing search algorithm ynovelty of algorithm should imply extra care ymore often, encourages lax implementation xit’s only a preliminary version zDon’t Trust Yourself ybug in innermost loop found by chance yall experiments re run with urgent deadline ycuriously, sometimes bugged version was better!
6 Getting Started zDo Make it Fast Enough yemphasis on enough xit’s often not necessary to have optimal code xin lifecycle of experiment, extra coding time not won back ye.g. published papers with inefficient code xcompared to state of the art first version O(N 2 ) too slow! Do Report Important Implementation Details xIntermediate versions produced good results zDo Preserve Your Code yOr end up fixing the same error twice (Do use version control!)
7 Exploratory Data Analysis zExploratory data analysis involves exploration yexploration of your results will suggest hypotheses xthat more formal experiments later can confirm/refute yTo suggest hypotheses you need the data in the first place zDo measure with many instruments yIn exploring hard problems we used our best algorithms ymissed important effects in worse algorithms xand these might affect best algorithms on larger instances
8 Exploratory Data Analysis zDo vary all relevant factors zDon’t change two things at once yAscribed effects of heuristic to the algorithm xchanged heuristic and algorithm at the same time xdidn’t perform factorial experiment yBut it’s not always easy/possible to do the “right” experiments if there are many factors zDo measure CPU time yIn exploratory code, CPU time often misleading xbut can also be very informative xe.g. heuristic needed more search but was faster
9 Exploratory Data Analysis zDo Collect All Data Possible …. (within reason) yOne year Santa Claus had to repeat all our experiments xpaper deadline just after new year! yWe had collected number of branches in search tree xbut not the number of backtracks xperformance scaled with backtracks, not branches xall experiments had to be rerun zDon’t Kill Your Machines yWe have got into trouble with sysadmins x… over experimental data we never used yOften the vital experiment is small and quick
10 Exploratory Data Analysis zDo It All Again … (or at least be able to) zDo Be Paranoid zDo Use The Same Problems yReproducibility is a key to science (c.f. Cold fusion) yBeing able to do it all again makes it possible xe.g. storing random seeds used in experiments xWe didn’t do that and might have lost important result yBeing paranoid allows health-checking e.g. confirm that ‘minor’ code changes do not change results “identical” implementations in C, Scheme, C, gave different results yUsing the same problems can reduce variance
11 Problems with Benchmarks zWe’ve seen the possible problem of overfitting yremember machine learning benchmarks? zTwo common approaches are used ybenchmark libraries xshould include hard problems and expand over time yrandom problems xshould include problems believed to be hard xallows unlimited test sets to be constructed xdisallows “cheating” by hardwiring algorithms xso what’s the problem?
12 Problems with Random Problems zDo Understand Your Problem Generator yConstraint satisfaction provides an undying example y40+ papers over 5 years by many authors xused random problems from “Models A, B, C, D” yAll four models were “flawed” xAchlioptas et al, 1997 xasymptotically almost all problems are trivial xbrings into doubt many experimental results some experiments at typical sizes affected fortunately not many yHow should we generate problems in future
13 Flawed and Flawless Problems zGent et al (1998) fixed flaw …. yIntroduced “flawless” problem generation ydefined in two equivalent ways ythough no proof that problems are truly flawless zThird year student at Strathclyde found new bug ytwo definitions of flawless not equivalent zFinally we settled on final definition of flawless yand gave proof of asymptotic non-triviality zSo we think we understand the problem generator!
14 Analysis of Data zAssuming you’ve got everything right so far … ythere are still lots of mistakes to make zDo Look at the Raw Data ySummaries obscure important aspects of behaviour yMany statistical measures explicitly designed to minimise effect of outliers ySometimes outliers are vital x“exceptionally hard problems” dominate mean xwe missed them until they hit us on the head when experiments “crashed” overnight old data on smaller problems showed clear behaviour
15 Analysis of Data zDo face up to the consequences of your results ye.g. preprocessing on 450 problems xshould “obviously” reduce search xreduced search 448 times xincreased search 2 times yForget algorithm, it’s useless? yOr study in detail the two exceptional cases xand achieve new understanding of an important algorithm
16 Presentation of Results zDo Present Statistics yIt’s easy to present “average” behaviour xWe failed to understand mismatch with published data xour mean was different to their median! yReaders need better understanding of your data xe.g. what was standard deviation, best, worst case? zDo Report Negative Results yThe experiment that disappoints you … xmight disappoint lots of others unless you report it!
17 Summary zEmpirical AI is an exacting science yThere are many ways to do experiments wrong zWe are experts in doing experiments badly zAs you perform experiments, you’ll make many mistakes zLearn from those mistakes, and ours!
18 And Finally … z … the most important advice of all? zDo Be Stupid y(would “refreshingly naïve” sound better?) ynature sometimes is less subtle than you think ye.g. scaling of behaviour in GSAT xit’s linear, stupid ye.g. understanding nature of arc consistency in CSP’s xuse a stupid algorithm, stupid