How good is my classifier?. 8/29/03Evaluating Hypotheses2  Have seen the accuracy metric  Classifier performance on a test set.

Presentation on theme: "How good is my classifier?. 8/29/03Evaluating Hypotheses2  Have seen the accuracy metric  Classifier performance on a test set."— Presentation transcript:

How good is my classifier?

8/29/03Evaluating Hypotheses2  Have seen the accuracy metric  Classifier performance on a test set

8/29/03Evaluating Hypotheses3  If we are to trust a classifier’s results  Must keep the classifier blindfolded  Make sure that classifier never sees the test data  When things seem too good to be true…

8/29/03Evaluating Hypotheses4  Confusion Matrix Predicted Actual classposneg postrue posfalse neg negfalse postrue neg

8/29/03Evaluating Hypotheses5  Sensitivity  Out of the things predicted as being positive, how many were correct  Specificity  Out of the things predicted as being negative how many were correct Predicted Actual classposneg postrue posfalse neg negfalse postrue neg Not as sensitive if begins missing what it is trying to detect If identify more and more things as target class, then beginning to get less specific Not as sensitive if begins missing what it is trying to detect If identify more and more things as target class, then beginning to get less specific

8/29/03Evaluating Hypotheses6  Can we quantify our Uncertainty?  Will the accuracy hold with brand new, never before seen data?

8/29/03Evaluating Hypotheses7 Discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments Successes or failures—Just what we’re looking for!

8/29/03Evaluating Hypotheses8  Probability that the random variable R will take on a specific value r  Might be probability of an error or of a positive  Since we have been working with accuracy let’s go with positive  Book works with errors

8/29/03Evaluating Hypotheses9

8/29/03Evaluating Hypotheses10

8/29/03Evaluating Hypotheses11  How confident should I be in the accuracy measure?  If we can live with statements like:  95% of the accuracy measures will fall in the range of 94% and 97%  Life is good  Confidence interval

8/29/03Evaluating Hypotheses12

8/29/03Evaluating Hypotheses13  In R  lb=qbinom(.025,n,p)  ub=qbinom(.975,n,p)  Lower and upper bound constitute confidence interval

8/29/03Evaluating Hypotheses14  What if none of the small cluster of Blues were in the training set?  All of them would be in the test set  How well would it do?  Sample error vs. true error  Might have been an accident—a pathological case

8/29/03Evaluating Hypotheses15  What if we could test the classifier several times with different test sets  If it performed well each time wouldn’t we be more confident in the results?

8/29/03Evaluating Hypotheses16  Usually we have a big chunk of training data  If we bust it up into randomly drawn chunks  Can train on remainder  And test with chunk

8/29/03Evaluating Hypotheses17  If 10 chunks  Train 10 times  Now have performance data on ten completely different test datasets

8/29/03Evaluating Hypotheses18  Must stay blindfolded while training  Must discard all lessons after each fold

8/29/03Evaluating Hypotheses19  Weka and DataMiner both default to 10-fold  Could be just as easily be 20-fold or 25-fold  With 20-fold it would be a 95-5 split Performance is reported as the average accuracy across the K runs

8/29/03Evaluating Hypotheses20 If 10-fold satisfies this should be in good shape

8/29/03Evaluating Hypotheses21  Called of leave-one-out  Disadvantage: slow  Largest possible training set  Smallest possible test set Has been promoted as an unbiased estimator or error Recent studies indicate that there is no unbiased estimator

8/29/03Evaluating Hypotheses22  Can calculate confidence interval with a single test set  More runs (K-fold) gives us more confidence that we didn’t just get lucky in test set selection  Do these runs help narrow the confidence interval?

8/29/03Evaluating Hypotheses23  Central limit applies  As the number of runs grows the distribution approaches normal  With a reasonably large number of runs we can derive a more trustworthy confidence interval With 30 test runs (30-fold) can use traditional approaches to calculating mean and standard deviations, and therefore: confidence intervals

8/29/03Evaluating Hypotheses24

8/29/03Evaluating Hypotheses25 meanAcc = mean(accuracies) sdAcc = sd(accuracies) qnorm(.975,meanAcc,sdAcc) 0.9980772 qnorm(.025,meanAcc,sdAcc) 0.8169336

8/29/03Evaluating Hypotheses26  Can we say that one classifier is significantly better than another  T-test  Null hypothesis: they are from the same distribution

8/29/03Evaluating Hypotheses27 In R t.test(distOne,distTwo,paired = TRUE) Paired t-test data: distOne and distTwo t = -55.8756, df = 29, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.2052696 -0.1907732 sample estimates: mean of the differences -0.1980214

8/29/03Evaluating Hypotheses28 In Perl use Statistics::TTest; my \$ttest = new Statistics::TTest; \$ttest->load_data(\@r1,\@r2); \$ttest->set_significance(95); \$ttest->print_t_test(); print "\n\nt statistic is ". \$ttest->t_statistic."\n"; print "p val ".\$ttest->{t_prob}."\n"; t_prob: 0 significance: 95 … df1: 29 alpha: 0.025 t_statistic: 12.8137016607408 null_hypothesis: rejected t statistic is 12.8137016607408 p val 0 t_prob: 0 significance: 95 … df1: 29 alpha: 0.025 t_statistic: 12.8137016607408 null_hypothesis: rejected t statistic is 12.8137016607408 p val 0

8/29/03Evaluating Hypotheses29  The classifier performed exceptionally well achieving 99.9% classifier accuracy on the 1,000 member training set.  The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 10-fold cross- validation on a training-set of size 1,000.  The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 10-fold cross- validation on a training-set of size 1,000. The variance in the ten accuracy measures indicates a 95% confidence interval of 97%-98%.  The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 30-fold cross- validation on a training-set of size 1,000. The variance in the thirty accuracy measures indicates a 95% confidence interval of 97%-98%.

8/29/03Evaluating Hypotheses30  Randomly permute an array  From the Perl Cookbook  http://docstore.mik.ua/orelly/perl/cookbook/ch04_18.htm http://docstore.mik.ua/orelly/perl/cookbook/ch04_18.htm sub fisher_yates_shuffle { my \$array = shift; my \$i; for (\$i = @\$array; --\$i; ) { my \$j = int rand (\$i+1); next if \$i == \$j; @\$array[\$i,\$j] = @\$array[\$j,\$i]; } sub fisher_yates_shuffle { my \$array = shift; my \$i; for (\$i = @\$array; --\$i; ) { my \$j = int rand (\$i+1); next if \$i == \$j; @\$array[\$i,\$j] = @\$array[\$j,\$i]; }

8/29/03Evaluating Hypotheses31

8/29/03Evaluating Hypotheses32

Download ppt "How good is my classifier?. 8/29/03Evaluating Hypotheses2  Have seen the accuracy metric  Classifier performance on a test set."

Similar presentations