Presentation on theme: "Intelligence Artificial Intelligence Ian Gent Empirical Evaluation of AI Systems."— Presentation transcript:
Intelligence Artificial Intelligence Ian Gent firstname.lastname@example.org Empirical Evaluation of AI Systems
Intelligence Artificial Intelligence Part I :Philosophy of Science Part II: Experiments in AI Part III: Basics of Experimental Design with AI case studies Empirical Evaluation of Computer Systems
3 Science as Refutation zModern view of the progress of Science based on Popper. (Sir Karl Popper, that is) zA scientific theory is one that can be refuted zI.e. it should make testable predictions yIf these predictions are incorrect, the theory is false ytheory may still be useful, e.g. Newtonian physics zTherefore science is hypothesis testing zArtificial intelligence aspires to be a science
4 Empirical Science zEmpirical = “Relying upon or derived from observation or experiment” zMost (all) of Science is empirical. yConsider theoretical computer science xstudy based on Turing machines, lambda calculus, etc yFounded on empirical observation that computer systems developed to date are Turing-complete yQuantum computers might challenge this xif so, an empirically based theory of quantum computing will develop
5 Theory, not Theorems zTheory based science need not be all theorems yotherwise science would be Mathematics zCompare Physics theory “QED” ymost accurate theory in the whole of science? ybased on a model of behaviour of particles ypredictions accurate to many decimal places (9?) ysuccess derived from accuracy of predictions xnot the depth or difficulty or beauty of theorems yI.e. QED is an empirical theory zAI/CS has too many theorems and not enough theory ycompare advice on how to publish in JACM
6 Empirical CS/AI zComputer programs are formal objects yso some use only theory that can be proved by theorems ybut theorems are hard zTreat computer programs as natural objects ylike quantum particles, chemicals, living objects xperform empirical experiments yWe have a huge advantage over other sciences xno need for supercolliders (expensive) or animal experiments (ethical problems) xwe should have complete command of experiments
7 What are our hypotheses? zMy search program is better than yours zSearch cost grows exponentially with number of variables for this kind of problem zConstraint search systems are better at handling overconstrained systems, but OR systems are better at handling underconstrained systems zMy company should buy an AI search system rather than an OR one
8 Why do experiments? zToo often AI experimenters might talk like this: yWhat is your experiment for? yis my algorithm better than his? yWhy? yI want to know which is faster yWhy? yLots of people use each kind … yHow will these people use your result? y?
9 Why do experiments? zCompare experiments on identical twins: yWhat is your experiment for? yI want to find out if twins reared apart to those reared together and nonidentical twins too. yWhy? yWe can get estimates of the genetic and social contributors to performance yWhy? yBecause the role of genetics in behavior is one of the great unsolved questions. zExperiments should address research questions yotherwise they can just be “track meets”
10 Basic issues in Experimental Design zFrom Paul R Cohen, Empirical Methods for Artificial Intelligence, MIT Press, 1995, Chapter 3 zControl zCeiling and Floor effects zSampling Biases
11 Control zA control is an experiment in which the hypothesised variation does not occur yso the hypothesised effect should not occur either ze.g. Macaque monkeys given vaccine based on human T-cells infected with SIV (relative of HIV) ymacaques gained immunity from SIV zLater, macaques given uninfected human T-cells yand macaques still gained immunity! zControl experiment not originally done yand not always obvious (you can’t control for all variables)
12 Case Study: MYCIN zMYCIN was a medial expert system yrecommended therapy for blood/meningitis infections zHow to evaluate its recommendations? zShortliffe used y10 sample problems y8 other therapy recommenders x5 faculty at Stanford Med. School, 1 senior resident, 1 senior postdoctoral researcher, 1 senior student y8 impartial judges gave 1 point per problem yMax score was 80 yMycin: 65 Faculty 40-60, Fellow 60, Resident 45, Student 30
13 Case Study: MYCIN zWhat were controls? zControl for judge’s bias for/against computers yjudges did not know who recommended each therapy zControl for easy problems ymedical student did badly, so problems not easy zControl for our standard being low ye.g. random choice should do worse zControl for factor of interest ye.g. hypothesis in MYCIN that “knowledge is power” yhave groups with different levels of knowledge
14 Ceiling and Floor Effects zWell designed experiments can go wrong zWhat if all our algorithms do particularly well (or they all do badly)? zWe’ve got little evidence to choose between them zCeiling effects arise when test problems are insufficiently challenging yfloor effects the opposite, when problems too challenging zA problem in AI because we often use benchmark sets zBut how do we detect the effect?
15 Ceiling Effects: Machine Learning z14 datasets from UCI corpus of benchmarks yused as mainstay of ML community zProblem is learning classification rules yeach item is vector of features and a classification ymeasure classification accuracy of method (max 100%) zCompare C4 with 1R*, two competing algorithms: zDataSet:BCCHGLG2HDHE…Mean zC47299.263.274.373.681.2...85.9 z1R*72.569.256.4777885.1...83.8
16 Ceiling Effects zDataSet:BCCHGLG2HDHE…Mean zC47299.263.274.373.681.2...85.9 z1R*72.569.256.4777885.1...83.8 zMax72.599.263.2777885.1…87.4 yC4 achieves only about 2% better than 1R* yIf we take the best of the C4/1R* in each case, we can only achieve 87.4% accuracy yWe have only weak evidence that C4 better yboth methods performing near ceiling of possible yCeiling effect is that we can’t compare the two methods well because both are achieving near the best practicable
17 Ceiling Effects zIn fact 1R* only uses one feature (the best one) zC4 uses on average 6.6 features z5.6 features buy only about 2% improvement zConclusion? yEither real world learning problems are easy (use 1R*) yOr we need more challenging datasets yWe need to be aware of ceiling effects in results
18 Sampling Bias zSampling bias is when data collection is biased against certain data ye.g. teacher who says “Girls don’t answer maths question” yobservation might suggest that … xindeed girls don’t answer many questions xbut that the teacher doesn’t ask them many questions yExperienced AI researchers don’t do that, right?
19 Case Study: Phoenix zPhoenix = AI system to fight (simulated) forest fires zExperiments suggested that wind speed uncorrelated with time to put out fire yobviously incorrect (high winds spread forest fires) zWind Speed vs containment time (max 150 hours): z3:1205579 101402615110 12 5410103 z6: 7861588171572132 70 z9: 62482155101 zWhat’s the problem?
20 Sampling bias in Phoenix zThe cut-off of 150 hours introduces sampling bias zMany high-wind fires get cut off, not many low wind zOn remaining data, there is no correlation between wind speed and time (r = -0.53) zIn fact, data shows that: ya lot of high wind fires take > 150 hours to contain ythose that don’t are similar to low wind fires zYou wouldn’t do this, right? You might if you had automated data analysis.