Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning from Examples: Standard Methodology for Evaluation 1) Start with a dataset of labeled examples 2) Randomly partition into N groups 3a) N times,

Similar presentations


Presentation on theme: "Learning from Examples: Standard Methodology for Evaluation 1) Start with a dataset of labeled examples 2) Randomly partition into N groups 3a) N times,"— Presentation transcript:

1 Learning from Examples: Standard Methodology for Evaluation 1) Start with a dataset of labeled examples 2) Randomly partition into N groups 3a) N times, combine N -1 groups into a train set 3b) Provide train set to learning system 3c) Measure accuracy on left out group (the test set) Called N -fold cross validation (typically N =10) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

2 Using Tuning Sets Often, an ML system has to choose when to stop learning, select among alternative answers, etc.Often, an ML system has to choose when to stop learning, select among alternative answers, etc. One wants the model that produces the highest accuracy on examples (overfitting avoidance)One wants the model that produces the highest accuracy on future examples (overfitting avoidance) It is a cheat to look at the test set while still learningIt is a cheat to look at the test set while still learning Better methodBetter method Set aside part of the training setSet aside part of the training set Measure performance on this tuning data to estimate future performance for a given set of parametersMeasure performance on this tuning data to estimate future performance for a given set of parameters Use best parameter settings, train with all training data (except test set) to estimate future performance on new examplesUse best parameter settings, train with all training data (except test set) to estimate future performance on new examples © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

3 Experimental Methodology: A Pictorial Overview generate solutions select best LEARNER training examples train set tune set testing examples classifier expected accuracy on future examples collection of classified examples Statistical techniques such as 10- fold cross validation and t-tests are used to get meaningful results © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

4 Proper Experimental Methodology Can Have a Huge Impact! A 2002 paper in Nature (a major, major journal) needed to be corrected due to training on the testing set Original report : 95% accuracy (5% error rate) Corrected report (which still is buggy): 73% accuracy (27% error rate) Error rate increased over 400%!!! © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

5 Parameter Setting Notice that each train/test fold may get different parameter settings! Thats fine (and proper)Thats fine (and proper) I.e., a parameterless* algorithm internally sets parameters for each data set it gets © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

6 Using Multiple Tuning Sets Using a single tuning set can be unreliable predictor, plus some data wasted Hence, often the following is done:Using a single tuning set can be unreliable predictor, plus some data wasted Hence, often the following is done: 1) For each possible set of parameters, a) Divide training data into train and tune sets, using N-fold cross validation b) Score this set of parameter values, average tune set accuracy 2) Use best combination of parameter settings on all (train + tune) examples 3) Apply resulting model to test set © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

7 Tuning a Parameter - Sample Usage Step1: Try various values for k (eg, in kNN). Use 10 train/tune splits for each k Step2: Pick best value for k (eg. k = 2), Then train using all training data Step3: Measure accuracy on test set K=1 tunetrain Tune set accuracy (ave. over 10 runs)=92% K=2Tune set accuracy (ave. over 10 runs)=97% … Tune set accuracy (ave. over 10 runs)=80% K=100 © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

8 What to Do for the FIELDED System? Do not use any test setsDo not use any test sets Instead only use tuning sets to determine good parametersInstead only use tuning sets to determine good parameters Test sets used to estimate future performanceTest sets used to estimate future performance You can report this estimate to your customer, then use all the data to retrain a product to give themYou can report this estimate to your customer, then use all the data to retrain a product to give them © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

9 Whats Wrong with This? 1.Do a cross-validation study to set parameters 2.Do another cross-validation study, using the best parameters, to estimate future accuracy How will this relate to the true future accuracy?How will this relate to the true future accuracy? Likely to be an overestimateLikely to be an overestimate What about 1.Do a proper train/tune/test experiment 2.Improve your algorithm; goto 1 (Machine Learnings dirty little secret!) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

10 Why Not Learn After Each Test Example? In production mode, this would make sense (assuming one received the correct label)In production mode, this would make sense (assuming one received the correct label) In experiments, we wish to estimateIn experiments, we wish to estimate Probability well label the next example correctly Probability well label the next example correctly need several samples to accurately estimate © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

11 Choosing a Good N for CV (from Weiss & Kulikowski Textbook) # of Examples < 50< < exs < < exs < 100 > 100> 100 Method Instead, use Bootstrapping (B. Ephron) See bagging later in cs760 Leave-one-out (Jack knife) N = size of data set (leave out one example each time) (leave out one example each time) 10-fold cross validation (CV), also useful for t-tests © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

12 Recap: N -fold Cross Validation Can be used toCan be used to 1) estimate future accuracy (by test sets) 2) choose parameter settings (by tuning sets) MethodMethod 1) Randomly permute examples 2) Divide into N bins 3) Train on N-1 bins, measure performance on bin left out 4) Compute average accuracy on held-out sets Examples Fold 1 Fold 2Fold 3Fold 4 Fold 5 © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

13 Confusion Matrices - Useful Way to Report TESTSET Errors Useful for NETtalk testbed – task of pronouncing written words © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

14 Scatter Plots - Compare Two Algos on Many Datasets Algo As Error Rate Algo Bs Error Rate Each dot is the error rate of the two algos on ONE dataset © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

15 Statistical Analysis of Sampling Effects Assume we get e errors on N test set examplesAssume we get e errors on N test set examples What can we say about the accuracy of our estimate of the true (future) error rate?What can we say about the accuracy of our estimate of the true (future) error rate? Well assume test set/future examples independently drawn (iid assumption)Well assume test set/future examples independently drawn (iid assumption) Can give probability our true error rate is in some range – error barsCan give probability our true error rate is in some range – error bars © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

16 The Binomial Distribution Distribution over the number of successes in a fixed number n of independent trials (with same probability of success p in each)Distribution over the number of successes in a fixed number n of independent trials (with same probability of success p in each) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

17 Using the Binomial Let each test case (test data point) be a trial, and let a success be an incorrect predictionLet each test case (test data point) be a trial, and let a success be an incorrect prediction Maximum likelihood estimate of probability p of success is fraction of predictions wrongMaximum likelihood estimate of probability p of success is fraction of predictions wrong Can exactly compute probability that error rate estimate p is off by more than some amount, say 0.025, in either directionCan exactly compute probability that error rate estimate p is off by more than some amount, say 0.025, in either direction For large N, this computations expensiveFor large N, this computations expensive © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

18 Central Limit Theorem Roughly, for large enough N, all distributions look Gaussian when summing/averaging N valuesRoughly, for large enough N, all distributions look Gaussian when summing/averaging N values Surprisingly, N = 30 is large enough! (in most cases at least) - see pg 132 of textbook 01 Ave Y over N trials (repeated many times) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

19 Confidence Intervals © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

20 As You Already Learned in Stat 101 If we estimate μ (mean error rate) and σ (std dev), we can say our ML algos error rate is μ ± Z M σ μ ± Z M σ Z M : value you looked up in a table of N(0,1) for desired confidence; e.g., for 95% confidence its 1.96 © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

21 The Remaining Details The Remaining Details © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

22 Alternative: Bootstrap Confidence Intervals Given a data set of N items, sample N items uniformly with replacementGiven a data set of N items, sample N items uniformly with replacement Estimate value of interest (e.g., train on bootstrap sample, test on the rest)Estimate value of interest (e.g., train on bootstrap sample, test on the rest) Repeat some number of times (1000 or 10,000 typical)Repeat some number of times (1000 or 10,000 typical) 95% CI: values such that observed is only lower (higher) on 2.5% of runs95% CI: values such that observed is only lower (higher) on 2.5% of runs © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

23 Bootstrap Statisticians typically use this approach more now, given fast computersStatisticians typically use this approach more now, given fast computers Many applicationsMany applications CIs on stimates of accuracy, area under curve, etc.CIs on stimates of accuracy, area under curve, etc. CIs on stimates of mean squared error or absolute error in real-valued predictionCIs on stimates of mean squared error or absolute error in real-valued prediction P-values for one algorithm vs. another according to the above measuresP-values for one algorithm vs. another according to the above measures © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

24 Contingency Tables n(1,1) [true pos] n(1,0) [false pos] n(0,1) [false neg] n(0,0) [true neg] True Answer Algorithm Answer Counts of occurrences © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

25 TPR and FPR True Positive Rate = n(1,1) / ( n(1,1) + n(0,1) ) (TPR) = correctly categorized +s / total positives P(algo outputs + | + is correct) False Positive Rate = n(1,0) / ( n(1,0) + n(0,0) ) (FPR) = incorrectly categorized –s / total negs P(algo outputs + | - is correct) Can similarly define False Negative Rate and True Negative Rate See © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

26 ROC Curves ROC: Receiver Operating CharacteristicsROC: Receiver Operating Characteristics Started during radar research during WWIIStarted during radar research during WWII Judging algorithms on accuracy alone may not be good enough when getting a positive wrong costs more than getting a negative wrong (or vice versa)Judging algorithms on accuracy alone may not be good enough when getting a positive wrong costs more than getting a negative wrong (or vice versa) Eg, medical tests for serious diseasesEg, medical tests for serious diseases Eg, a movie-recommender (ala NetFlix) systemEg, a movie-recommender (ala NetFlix) system © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

27 ROC Curves Graphically 1.0 False positives rate True positives rate Prob (alg outputs + | + is correct) Prob (alg outputs + | - is correct) Ideal Spot Alg 1 Alg 2 Different algorithms can work better in different parts of ROC space. This depends on cost of false + vs false - © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

28 Creating an ROC Curve - the Standard Approach You need an ML algorithm that outputs NUMERIC results such as prob(example is +)You need an ML algorithm that outputs NUMERIC results such as prob(example is +) You can use ensembles (later) to get this from a model that only provides Boolean outputsYou can use ensembles (later) to get this from a model that only provides Boolean outputs Eg, have 100 models vote & count votesEg, have 100 models vote & count votes © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

29 Algo for Creating ROC Curves (most common but not only way) Step 1: Sort predictions on test set Step 2: Locate a threshold between examples with opposite categories Step 3: Compute TPR & FPR for each threshold of Step 2 Step 4: Connect the dots © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

30 Plotting ROC Curves - Example Ex Ex Ex Ex Ex Ex Ex Ex Ex Ex ML Algo Output (Sorted) Correct Category 1.0 P(alg outputs + | + is correct) P(alg outputs + | - is correct) TPR=(2/5), FPR=(0/5) TPR=(2/5), FPR=(1/5) TPR=(4/5), FPR=(1/5) TPR=(4/5), FPR=(3/5) TPR=(5/5), FPR=(3/5) TPR=(5/5), FPR=(5/5) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

31 To Get Smoother Curve, Linearly Interpolate 1.0 P(alg outputs + | - is correct) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

32 Note: Each point is a model plus a threshold… call that a prediction algorithm Achievable: To get points along a linear interpolation, flip weighted coin to choose between prediction algorithms Convex Hull:Perform all interpolations, and discard any point that lies below a line 1.0 P(alg outputs + | - is correct) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

33 Be Careful: The prediction algorithms (model and threshold pairs) that look best on training set may not be the best on future data Lessen Risk:Perform all interpolations and build convex hull using a tuning set 1.0 P(alg outputs + | - is correct) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

34 ROCs and Many Models (not in the ensemble sense) It is not necessary that we learn one model and then threshold its output to produce an ROC curveIt is not necessary that we learn one model and then threshold its output to produce an ROC curve You could learn different models for different regions of ROC spaceYou could learn different models for different regions of ROC space Eg, see Goadrich, Oliphant, & Shavlik ILP 04 and MLJ 06Eg, see Goadrich, Oliphant, & Shavlik ILP 04 and MLJ 06 © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

35 Area Under ROC Curve A common metric for experiments is to numerically integrate the ROC Curve 1.0 False positives True positives Area under curve (AUC) -- sometimes written AUCROC to be explicit © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

36 Asymmetric Error Costs Assume that cost(FP) != cost(FN)Assume that cost(FP) != cost(FN) You would like to pick a threshold that mimimizesYou would like to pick a threshold that mimimizes E(total cost) = cost(FP) x prob(FP) x (# of -) + cost(FN) x prob(FN) x (# of +) You could also have (maybe negative) costs for TP and TN (assumed zero in above)You could also have (maybe negative) costs for TP and TN (assumed zero in above) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

37 ROCs & Skewed Data One strength of ROC curves is that they are a good way to deal with skewed data (|+| >> |-|) since the axes are fractions (rates) independent of the # of examplesOne strength of ROC curves is that they are a good way to deal with skewed data (|+| >> |-|) since the axes are fractions (rates) independent of the # of examples You must be careful though!You must be careful though! Low FPR * (many negative ex) = sizable number of FPLow FPR * (many negative ex) = sizable number of FP Possibly more than # of TPPossibly more than # of TP © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

38 Precision vs. Recall (think about search engines) Precision = (# of relevant items retrieved) / (total # of items retrieved)Precision = (# of relevant items retrieved) / (total # of items retrieved) = n(1,1) / ( n(1,1) + n(1,0) ) = n(1,1) / ( n(1,1) + n(1,0) ) P(is pos | called pos) P(is pos | called pos) Recall = (# of relevant items retrieved) / (# of relevant items that exist)Recall = (# of relevant items retrieved) / (# of relevant items that exist) = n(1,1) / ( n(1,1) + n(0,1) ) = TPR = n(1,1) / ( n(1,1) + n(0,1) ) = TPR P(called pos | is pos) P(called pos | is pos) Notice that n(0,0) is not used in either formula Therefore you get no credit for filtering out irrelevant itemsNotice that n(0,0) is not used in either formula Therefore you get no credit for filtering out irrelevant items © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

39 ROC vs. Recall-Precision You can get very different visual results on the same data The reason for this is that there may be lots of – exs (eg, might need to include 100 negs to get 1 more pos) vs. P ( + | - )Recall Precision P ( + | + ) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

40 Recall-Precision Curves You cannot simply connect the dots in Recall-Precision curves as we did for ROC See Goadrich, Oliphant, & Shavlik, ILP 04 or MLJ 06 Recall Precision x © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

41 Interpolating in PR Space Would like to interpolate correctly, then remove points that lie below interpolationWould like to interpolate correctly, then remove points that lie below interpolation Analogous to convex hull in ROC spaceAnalogous to convex hull in ROC space Can you do it efficiently?Can you do it efficiently? Yes – convert to ROC space, take convex hull, convert back to PR space (Davis & Goadrich, ICML-06)Yes – convert to ROC space, take convex hull, convert back to PR space (Davis & Goadrich, ICML-06) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

42 The Relationship between Precision-Recall and ROC Curves Jesse Davis & Mark Goadrich Department of Computer Sciences University of Wisconsin © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

43 Four Questions about PR space and ROC space Q1: Does optimizing AUC in one space optimize it in the other space?Q1: Does optimizing AUC in one space optimize it in the other space? Q2: If a curve dominates in one space will it dominate in the other?Q2: If a curve dominates in one space will it dominate in the other? Q3: What is the best PR curve?Q3: What is the best PR curve? Q4: How do you interpolate in PR space?Q4: How do you interpolate in PR space? © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

44 Optimizing AUC Interest in learning algorithms that optimize Area Under the Curve (AUC) [ Ferri et al. 2002, Cortes and Mohri 2003, Joachims 2005, Prati and Flach 2005, Yan et al. 2003, Herschtal and Raskutti 2004 ]Interest in learning algorithms that optimize Area Under the Curve (AUC) [ Ferri et al. 2002, Cortes and Mohri 2003, Joachims 2005, Prati and Flach 2005, Yan et al. 2003, Herschtal and Raskutti 2004 ] Q: Does an algorithm that optimizes AUC-ROC also optimize AUC-PR?Q: Does an algorithm that optimizes AUC-ROC also optimize AUC-PR? A: No. Can easily construct counterexampleA: No. Can easily construct counterexample © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

45 Definition: Dominance © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

46 Definition: Area Under the Curve (AUC) Precision Recall TPR FPR © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

47 How do we evaluate ML algorithms? Common evaluation metricsCommon evaluation metrics ROC curves [Provost et al 98]ROC curves [Provost et al 98] PR curves [Raghavan 89, Manning & Schutze 99]PR curves [Raghavan 89, Manning & Schutze 99] Cost curves [Drummond and Holte 00, 04]Cost curves [Drummond and Holte 00, 04] If the class distribution is highly skewed, most believe PR curves preferable to ROC curvesIf the class distribution is highly skewed, most believe PR curves preferable to ROC curves © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

48 Two Highly Skewed Domains ?=?= Is an abnormality on a mammogram benign or malignant? Do these two identities refer to the same person? © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

49 Predicting Aliases [Synthetic data: Davis et al. ICIA 2005] © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

50 Predicting Aliases [Synthetic data: Davis et al. ICIA 2005] © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

51 Diagnosing Breast Cancer [Real Data: Davis et al. IJCAI 2005] © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

52 Diagnosing Breast Cancer [Real Data: Davis et al. IJCAI 2005] © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

53 A2: Dominance Theorem For a fixed number of positive and negative examples, one curve dominates another curve in ROC space if and only if the first curve dominates the second curve in PR space © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

54 Q3: What is the best PR curve? The best curve in ROC space for a set of points is the convex hull [Provost et al 98]The best curve in ROC space for a set of points is the convex hull [Provost et al 98] It is achievableIt is achievable It maximizes AUCIt maximizes AUC Q: Does an analog to convex hull exist in PR space? A2: Yes! We call it the Achievable PR Curve © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

55 Convex Hull © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

56 Convex Hull © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

57 A3: Achievable Curve © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

58 A3: Achievable Curve © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

59 Constructing the Achievable Curve Given: Set of PR points, fixed number positive and negative examples Translate PR points to ROC pointsTranslate PR points to ROC points Construct convex hull in ROC spaceConstruct convex hull in ROC space Convert the curve into PR spaceConvert the curve into PR spaceCorollary: By dominance theorem, the curve in PR space dominates all other legal PR curves you could construct with the given points © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

60 Q4: Interpolation Interpolation in ROC space is easyInterpolation in ROC space is easy Linear connection between pointsLinear connection between points TPR FPR A B © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

61 Linear Interpolation Not Achievable in PR Space Precision interpolation is counterintuitive [Goadrich, et al., ILP 2004]Precision interpolation is counterintuitive [Goadrich, et al., ILP 2004] TPFP TP Rate FP Rate RecallPrec Example CountsPR CurvesROC Curves © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

62 Q: For each extra TP covered, how many FPs do you cover? Example Interpolation TP FP FP REC REC PREC PREC A B A dataset with 20 positive and 2000 negative examples TP B -TP A FP B -FP A A: © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

63 Example Interpolation TP FP FP REC REC PREC PREC A B A dataset with 20 positive and 2000 negative examples © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

64 Example Interpolation TP FP FP REC REC PREC PREC A B A dataset with 20 positive and 2000 negative examples © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

65 Example Interpolation TP FP FP REC REC PREC PREC A B A dataset with 20 positive and 2000 negative examples © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

66 Back to Q2 A2, A3 and A4 relied on A2A2, A3 and A4 relied on A2 Now lets prove A2…Now lets prove A2… © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

67 Dominance Theorem For a fixed number of positive and negative examples, one curve dominates another curve in ROC space if and only if the first curve dominates the second curve in Precision-Recall space © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

68 For Fixed N, P and TPR: FPR Precision (Not =) True Answer Algorithm Answer NP © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

69 Conclusions about PR and ROC Curves A curve dominates in one space iff it dominates in the other spaceA curve dominates in one space iff it dominates in the other space Exists analog to convex hull in PR space, which we call the achievable PR curveExists analog to convex hull in PR space, which we call the achievable PR curve Linear interpolation not achievable in PR spaceLinear interpolation not achievable in PR space Optimizing AUC in one space does not optimize AUC in the other spaceOptimizing AUC in one space does not optimize AUC in the other space © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

70 To Avoid Pitfalls, Ask: 1. Is my held-aside test data really representative of going out to collect new data?1. Is my held-aside test data really representative of going out to collect new data? Even if your methodology is fine, someone may have collected features for positive examples differently than for negatives – should be randomizedEven if your methodology is fine, someone may have collected features for positive examples differently than for negatives – should be randomized Example: samples from cancer processed by different people or on different days than samples for normal controlsExample: samples from cancer processed by different people or on different days than samples for normal controls © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

71 To Avoid Pitfalls, Ask: 2. Did I repeat my entire data processing procedure on every fold of cross-validation, using only the training data for that fold?2. Did I repeat my entire data processing procedure on every fold of cross-validation, using only the training data for that fold? On each fold of cross-validation, did I ever access in any way the label of a test case?On each fold of cross-validation, did I ever access in any way the label of a test case? Any preprocessing done over entire data set (feature selection, parameter tuning, threshold selection) must not use labelsAny preprocessing done over entire data set (feature selection, parameter tuning, threshold selection) must not use labels © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

72 To Avoid Pitfalls, Ask: 3. Have I modified my algorithm so many times, or tried so many approaches, on this same data set the I (the human) am overfitting it?3. Have I modified my algorithm so many times, or tried so many approaches, on this same data set the I (the human) am overfitting it? Have I continually modified my preprocessing or learning algorithm until I got some improvement on this data set?Have I continually modified my preprocessing or learning algorithm until I got some improvement on this data set? If so, I really need to get some additional data now to at least test onIf so, I really need to get some additional data now to at least test on © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

73 Alg 1 vs. Alg 2 Alg 1 has accuracy 80%, Alg 2 82%Alg 1 has accuracy 80%, Alg 2 82% Is this difference significant?Is this difference significant? Depends on how many test cases these estimates are based onDepends on how many test cases these estimates are based on The test we do depends on how we arrived at these estimatesThe test we do depends on how we arrived at these estimates © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

74 Leave-One-Out: Sign Test Suppose we ran leave-one-out cross- validation on a data set of 100 casesSuppose we ran leave-one-out cross- validation on a data set of 100 cases Divide the cases into (1) Alg 1 won, (2) Alg 2 won, (3) Ties (both wrong or both right); Throw out the tiesDivide the cases into (1) Alg 1 won, (2) Alg 2 won, (3) Ties (both wrong or both right); Throw out the ties Suppose 10 ties and 50 wins for Alg 1Suppose 10 ties and 50 wins for Alg 1 Ask: Under (null) binomial(90,0.5), what is prob of 50+ or 40- successes?Ask: Under (null) binomial(90,0.5), what is prob of 50+ or 40- successes? © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

75 What about 10-fold? Difficult to get significance from sign test of 10 casesDifficult to get significance from sign test of 10 cases Were throwing out the numbers (accuracy estimates) for each fold, and just asking which is largerWere throwing out the numbers (accuracy estimates) for each fold, and just asking which is larger Use the numbers… t-test… designed to test for a difference of meansUse the numbers… t-test… designed to test for a difference of means © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

76 Paired Student t -tests GivenGiven 10 training/test sets10 training/test sets 2 ML algorithms2 ML algorithms Results of the 2 ML algos on the 10 test-setsResults of the 2 ML algos on the 10 test-sets DetermineDetermine Which algorithm is better on this problem?Which algorithm is better on this problem? Is the difference statistically significant?Is the difference statistically significant? © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

77 Paired Student t –Tests (cont.) Example Accuracies on Testsets Algorithm 1: 80%5075…99 Algorithm 2:794974…98 δ : …+1 Algorithm 1s mean is better, but the two std. Deviations will clearly overlapAlgorithm 1s mean is better, but the two std. Deviations will clearly overlap But algorithm1 is always better than algorithm 2But algorithm1 is always better than algorithm 2 i © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

78 Consider random variable δ= Algo As Algo Bs test-set i minus test-set i test-set i minus test-set i error error error error The Random Variable in the t -Test Notice were factoring out test-set difficulty by looking at relative performance In general, one tries to explain variance in results across experiments Here were saying that Variance = f( Problem difficulty ) + g( Algorithm strength ) i © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

79 More on the Paired t -Test Our NULL HYPOTHESIS is that the two ML algorithms have equivalent average accuracies i.e. differences (in the scores) are due to the random fluctuations about the mean of zeroi.e. differences (in the scores) are due to the random fluctuations about the mean of zero We compute the probability that the observed δ arose from the null hypothesis If this probability is low we reject the null hypo and say that the two algos appear differentIf this probability is low we reject the null hypo and say that the two algos appear different Low is usually taken as prob 0.05Low is usually taken as prob 0.05 © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

80 The Null Hypothesis Graphically (View #1) δ Assume zero mean and use the samples variance (sample = experiment) δ P(δ) 1. ½ (1 – M ) probability mass in each tail (ie, M inside) Typically M = 0.95 Does our measured δ lie in the regions indicated by arrows? If so, reject null hypothesis, since it is unlikely wed get such a δ by chance © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

81 View #2 – The Confidence Interval for δ δ Use samples mean and variance 2. Is zero in the M % of probability mass? If NOT, reject null hypothesis δ P(δ) © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

82 The t -Test Calculation Compute MeanMean Sample VarianceSample Variance Lookup t value for N folds and M confidence levelLookup t value for N folds and M confidence level - N-1 is called the degrees of freedom - N-1 is called the degrees of freedom - As N, t M,N-1 and Z M equivalent See table 5.6 in Mitchell We dont know an analytical expression for the variance, so we need to estimate it on the data © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

83 The t -test Calculation (cont.) - Using View #2 (get same result using view #1) Calculate The interval contains 0 if PDF δ © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

84 Some Jargon: P –values (Uses View #1) P -Value = Probability of getting ones results or greater, given the NULL HYPOTHESIS (We usually want P 0.05 to be confident that a difference is statistically significant) P NULL HYPO DISTRIBUTION © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

85 From Wikipedia (http://en.wikipedia.org/wiki/P-value) The p-value of an observed value X observed of some random variable X is the probability that, given that the null hypothesis is true, X will assume a value as or more unfavorable to the null hypothesis as the observed value X observed "More unfavorable to the null hypothesis" can in some cases mean greater than, in some cases less than, and in some cases further away from a specified center © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

86 Accepting the Null Hypothesis Note: even if the p –value is high, we cannot assume the null hypothesis is true Eg, if we flip a coin twice and get one head, can we statistically infer the coin is fair? Vs. if we flip a coin 100 times and observe 10 heads, we can statistically infer coin is unfair because that is very unlikely to happen with a fair coin How would we show a coin is fair? © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

87 More on the t -Distribution We typically dont have enough folds to assume the central- limit theorem. (i.e. N < 30) So, we need to use the t distributionSo, we need to use the t distribution Its wider (and hence, shorter) than the Gaussian (Z ) distribution (since PDFs integrate to 1)Its wider (and hence, shorter) than the Gaussian (Z ) distribution (since PDFs integrate to 1) Hence, our confidence intervals will be widerHence, our confidence intervals will be wider Fortunately, t -tables existFortunately, t -tables exist Gaussian tNtN different curve for each N © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

88 Some Assumptions Underlying our Calculations General Central Limit Theorem applies (I.e., >= 30 measurements averaged) ML-Specific #errors/#tests accurately estimates p, prob of error on 1 ex. - used in formula for which characterizes expected future deviations about mean (p ) Using independent sample space of possible instances - representative of future examples - individual exs iid drawn For paired t-tests, learned classifier same for each fold (stability) since combining results across folds © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

89 Stability Stability = how much the model an algorithm learns changes due to minor perturbations of the training set Paired t -test assumptions are a better match to stable algorithm Example: k-NN, higher the k, the more stable © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

90 More on Paired t -Test Assumption Ideally train on one data set and then do a 10-fold paired t-test What we should do: traintest1 … test10 What we usually do: train1test1 … train10test10 However, not enough data usually to do the ideal If we assume that train data is part of each paired experiment then we violate independence assumptions - each train set overlaps 90% with every other train set Learned model does not vary while were measuring its performance © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

91 Note: Many Statisticians Prefer Bootstrap Instead Given a data set of N examples, do M times (where M typically 1K or 10K):Given a data set of N examples, do M times (where M typically 1K or 10K): Sample N examples from the data set randomly, uniformly, with replacementSample N examples from the data set randomly, uniformly, with replacement Train both algorithms on the sampled data set and test on the remaining dataTrain both algorithms on the sampled data set and test on the remaining data P-value is fraction of runs on which Alg A is no better than Alg BP-value is fraction of runs on which Alg A is no better than Alg B © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

92 The Great Debate (or one of them, at least) Should you use a one-tailed or a two-tailed t-test?Should you use a one-tailed or a two-tailed t-test? A two-tailed test asks the question: Are algorithms A and B statistically different?A two-tailed test asks the question: Are algorithms A and B statistically different? A one-tailed test asks the question: Is algorithm A statistically better than algorithm B?A one-tailed test asks the question: Is algorithm A statistically better than algorithm B? © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

93 One vs. Two-Tailed Graphically P(x) x 2.5% One-Tailed Test Two-Tailed Test © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

94 The Great Debate (More) Which of these tests should you use when comparing your new algorithm to a state-of-the-art algorithm?Which of these tests should you use when comparing your new algorithm to a state-of-the-art algorithm? You should use two tailed, because by using it, you are saying there is a chance I am better and a chance I am worseYou should use two tailed, because by using it, you are saying there is a chance I am better and a chance I am worse One tailed is saying, I know my algorithm is no worse, and therefore you are allowed a larger margin of errorOne tailed is saying, I know my algorithm is no worse, and therefore you are allowed a larger margin of error See By being more confident, it is easier to show significance! © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

95 Two Sided vs. One Sided You need to very carefully think about the question you are asking Are we within x of the true error rate? Measured mean mean - xmean + x © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

96 Two Sided vs. One Sided How confident are we that ML System As accuracy is at least 85%? 85% © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

97 Two Sided vs. One Sided Is ML algorithm A no more accurate than algorithm B? A - B © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)

98 Two Sided vs. One Sided Are ML algorithm A and B equivalently accurate? A - B © Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison)


Download ppt "Learning from Examples: Standard Methodology for Evaluation 1) Start with a dataset of labeled examples 2) Randomly partition into N groups 3a) N times,"

Similar presentations


Ads by Google