2 Evaluation How predictive is the model we learned? Error on the training data is not a good indicator of performance on future dataSimple solution that can be used if lots of (labeled) data is available:Split data into training and test setHowever: (labeled) data is usually limitedMore sophisticated techniques need to be used
3 Training and testingNatural performance measure for classification problems: error rateSuccess: instance’s class is predicted correctlyError: instance’s class is predicted incorrectlyError rate: proportion of errors made over the whole set of instancesResubstitution error: error rate obtained from training dataResubstitution error is (hopelessly) optimistic!Test set: independent instances that have played no part in formation of classifierAssumption: both training data and test data are representative samples of the underlying problem
4 Making the most of the data Once evaluation is complete, all the data can be used to build the final classifierGenerally, the larger the training data the better the classifierThe larger the test data the more accurate the error estimateHoldout procedure: method of splitting original data into training and test setDilemma: ideally both training set and test set should be large!
5 Predicting performance Assume the estimated error rate is 25%. How close is this to the true error rate?Depends on the amount of test dataPrediction is just like tossing a (biased!) coin“Head” is a “success”, “tail” is an “error”In statistics, a succession of independent events like this is called a Bernoulli processStatistical theory provides us with confidence intervals for the true underlying proportion
6 Confidence intervalsWe can say: p lies within a certain specified interval with a certain specified confidenceExample: S=750 successes in N=1000 trialsEstimated success rate: 75%How close is this to true success rate p?Answer: with 80% confidence p[73.2,76.7]Another example: S=75 and N=100With 80% confidence p[69.1,80.1]I.e. the probability that p[69.1,80.1] is 0.8.Bigger the N more confident we are, i.e. the surrounding interval is smaller.Above, for N=100 we were less confident than for N=1000.
7 Mean and Variance Let Y be the random variable with possible values 1 for success and0 for error.Let probability of success be p.Then probability of error is q=1-p.What’s the mean?1*p + 0*q = pWhat’s the variance?(1-p)2*p + (0-p)2*q= q2*p+p2*q= pq(p+q)= pq
8 Estimating p Well, we don’t know p. Our goal is to estimate p. For this we make N trials, i.e. tests.More trials more confident we are.Let S be the random variable denoting the number of successes, i.e. S is the sum of N value samplings of Y.Now, we approximate p with the success rate in N trials, i.e. S/N.By the Central Limit Theorem, when N is big, the probability distribution of S/N is approximated by a normal distribution withmean p andvariance pq/N.
9 Estimating pc% confidence interval [–z ≤ X ≤ z] for random variable with 0 mean is given by:Pr[− z≤ X≤ z]= cWith a symmetric distribution:Pr[− z≤ X≤ z]=1−2× Pr[ x≥ z]Confidence limits for the normal distribution with 0 mean and a variance of 1:Thus: Pr[−1.65≤ X≤1.65]=90%To use this we have to reduce our random variable S/N to have 0 mean and unit variance
10 Estimating p Thus: Pr[−1.65≤ X≤1.65]=90% To use this we have to reduce our random variable S/N to have 0 mean and unit variance:Pr[−1.65≤ (S/N – p) / S/N ≤1.65]=90%Now we solve two equations:(S/N – p) / S/N) =1.65(S/N – p) / S/N) =-1.65
11 Estimating p Let N=100, and S=70 S/N is sqrt(pq/N) and we approximate it bysqrt(p'(1-p')/N)where p' is the estimation of p, i.e. 0.7So, S/N is approximated by sqrt(.7*.3/100) = .046The two equations become:(0.7 – p) / .046 =1.65p = *.046 = .624(0.7 – p) / .046 =-1.65p = *.046 = .776Thus, we say:With a 90% confidence we have that the success rate p of the classifier will be0.624 p 0.776
12 Holdout estimation What to do if the amount of data is limited? The holdout method reserves a certain amount for testing and uses the remainder for trainingUsually: one third for testing, the rest for trainingProblem: the samples might not be representativeExample: class might be missing in the test dataAdvanced version uses stratificationEnsures that each class is represented with approximately equal proportions in both subsets
13 Repeated holdout method Holdout estimate can be made more reliable by repeating the process with different subsamplesIn each iteration, a certain proportion is randomly selected for training (possibly with stratificiation)The error rates on the different iterations are averaged to yield an overall error rateThis is called the repeated holdout methodStill not optimum: the different test sets overlapCan we prevent overlapping?
14 Cross-validation Cross-validation avoids overlapping test sets First step: split data into k subsets of equal sizeSecond step: use each subset in turn for testing, the remainder for trainingCalled k-fold cross-validationOften the subsets are stratified before the cross-validation is performedThe error estimates are averaged to yield an overall error estimateStandard method for evaluation: stratified 10-fold cross-validation
15 Leave-One-Out cross-validation Leave-One-Out: a particular form of cross-validation:Set number of folds to number of training instancesI.e., for n training instances, build classifier n timesMakes best use of the dataInvolves no random subsamplingBut, computationally expensive
16 Leave-One-Out-CV and stratification Disadvantage of Leave-One-Out-CV: stratification is not possibleIt guarantees a non-stratified sample because there is only one instance in the test set!Extreme example: completely random dataset split equally into two classesBest inducer predicts majority class50% accuracy on fresh dataLeave-One-Out-CV estimate is 100% error!
17 The bootstrap Cross Validation uses sampling without replacement The same instance, once selected, can not be selected again for a particular training/test setThe bootstrap uses sampling with replacement to form the training setSample a dataset of n instances n times with replacement to form a new dataset of n instancesUse this data as the training setUse the instances from the original dataset that don’t occur in the new training set for testingAlso called the bootstrap (Why?)
18 The 0.632 bootstrap Also called the 0.632 bootstrap A particular instance has a probability of 1–1/n of not being pickedThus its probability of ending up in the test data is:This means the training data will contain approximately 63.2% of the instances
19 Estimating error with the bootstrap The error estimate on the test data will be very pessimisticTrained on just ~63% of the instancesTherefore, combine it with the resubstitution error:The resubstitution error gets less weight than the error on the test dataRepeat process several times with different replacement samples; average the resultsProbably the best way of estimating performance for very small datasets
20 Comparing data mining schemes Frequent question: which of two learning schemes performs better?Note: this is domain dependent!Obvious way: compare 10-fold Cross Validation estimatesProblem: variance in estimateVariance can be reduced using repeated CVHowever, we still don’t know whether the results are reliable (need to use statistical-test for that)
21 Counting the costIn practice, different types of classification errors often incur different costsExamples:Terrorist profiling“Not a terrorist” correct 99.99% of the timeLoan decisionsFault diagnosisPromotional mailing
22 Counting the cost The confusion matrix: Predicted class Yes No Actual classTrue positiveFalse negativeFalse positiveTrue negative