Data Science Credibility: Evaluating What’s Been Learned

Slides:



Advertisements
Similar presentations
Learning Algorithm Evaluation
Advertisements

國立雲林科技大學 National Yunlin University of Science and Technology Predicting adequacy of vancomycin regimens: A learning-based classification approach to improving.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
Evaluation.
Ensemble Learning: An Introduction
Evaluation and Credibility How much should we believe in what was learned?
Experimental Evaluation
Evaluation and Credibility
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
Evaluation of Learning Models
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
Evaluating Classifiers
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
CLassification TESTING Testing classifier accuracy
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.
CpSc 810: Machine Learning Evaluation of Classifier.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Evaluating Results of Learning Blaž Zupan
1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages
Validation methods.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Credibility: Evaluating What’s Been Learned Predicting.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten and E. Frank.
Data Mining Practical Machine Learning Tools and Techniques
Data Mining Practical Machine Learning Tools and Techniques
Machine Learning: Ensemble Methods
7. Performance Measurement
Data Mining Practical Machine Learning Tools and Techniques
Data Science Credibility: Evaluating What’s Been Learned Evaluating Numeric Prediction WFH: Data Mining, Section 5.8 Rodney Nielsen Many of these.
Data Science Algorithms: The Basic Methods
Evaluating Results of Learning
9. Credibility: Evaluating What’s Been Learned
Performance evaluation
Data Mining Practical Machine Learning Tools and Techniques
Data Mining Practical Machine Learning Tools and Techniques
Machine Learning Techniques for Data Mining
Weka Free and Open Source ML Suite Ian Witten & Eibe Frank
Evaluation and Its Methods
Introduction to Data Mining, 2nd Edition
Learning Algorithm Evaluation
CSCI N317 Computation for Scientific Applications Unit Weka
Ensemble learning.
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Evaluation and Its Methods
CS639: Data Management for Data Science
Evaluation and Its Methods
Introduction to Machine learning
Data Mining Practical Machine Learning Tools and Techniques
Evaluation David Kauchak CS 158 – Fall 2019.
Slides for Chapter 5, Evaluation
Presentation transcript:

Data Science Credibility: Evaluating What’s Been Learned Training and Testing; k-fold Cross-Validation; Bootstrap WFH: Data Mining, Sections 5.1, 3, 4 Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall

Credibility: Evaluating What’s Been Learned Issues: training, testing, tuning Holdout, cross-validation, bootstrap Predicting performance Comparing schemes: the t-test

Evaluation: The Key to Success How predictive is the model we learned? Error on the training data is not a good indicator of performance on future data Otherwise 1-NN would be the optimum classifier! Simple solution that can be used if lots of (labeled) data is available: Split data into training and test set However: (labeled) data is often limited More sophisticated techniques need to be used

Issues in Evaluation Statistical reliability of estimated differences in performance (→ significance tests) Choice of performance measure: Number of correct classifications Accuracy of probability estimates Error in numeric predictions Costs assigned to different types of errors Many practical applications involve costs

Training and Testing I Natural performance measure for classification problems: error rate Error: instance’s class is predicted incorrectly Error rate: proportion of errors made over the whole set of instances Under what conditions might this not be a very good indicator of performance?

Training and Testing II Resubstitution error: error rate obtained from training data Resubstitution error (error on the training data) is (hopelessly) optimistic! Why?

Testing and Training III Test set: independent instances that have played no part in formation of classifier Assumption: both training data and test data are representative samples of the underlying problem: i.i.d. data (independent and identically distributed) Test and training data may differ in nature Example: classifiers built using customer data from two different towns A and B To estimate performance of classifier from town A on a completely new town, test it on data from B

Note on Parameter Tuning It is important that the test data is not used in any way to create the classifier Some learning schemes operate in two stages: Stage 1: build the basic structure Stage 2: optimize parameter settings The test data can’t be used for any tuning! One proper procedure uses three datasets: training data, validation data, and test data Validation data is used to optimize parameters

Making the Most of the Data Once evaluation is complete, all the data can be used to build the final classifier Generally, the larger the training data the better the classifier (but returns diminish) The larger the test data the more accurate the error estimate Holdout procedure: method of splitting original data into training and test set Dilemma: ideally both training set and test set should be large!

Credibility: Evaluating What’s Been Learned Issues: training, testing, tuning Holdout, cross-validation, bootstrap Predicting performance Comparing schemes: the t-test

Holdout Estimation What to do if the amount of data is limited? The holdout method reserves a certain amount for testing and uses the remainder for training Often: one third for testing, the rest for training Supposed Problem: the samples might not be representative Example: class might be missing in the test data Supposed “advanced” version uses stratification Ensures that each class is represented with approximately equal proportions in both subsets What might be a negative consequence of this?

Repeated Holdout Method Holdout estimate can be made more reliable by repeating the process with different subsamples In each iteration, a certain proportion is randomly selected for training (possibly with stratification) The error rates on the different iterations are averaged to yield an overall error rate This is called the repeated holdout method Or random resampling Still not optimum: the different test sets overlap Can we prevent overlapping?

Cross-Validation Cross-validation avoids overlapping test sets First step: split data into k subsets of equal size Second step: use each subset in turn for testing, the remainder for training Called k-fold cross-validation Often the subsets are stratified before the cross-validation is performed This is the default in Weka, but it isn’t necessarily a good idea The error estimates are averaged to yield an overall error estimate

More on Cross-Validation Standard method for evaluation in Weka: stratified ten-fold cross-validation Why ten? The authors claim*: Extensive experiments have shown this is best choice for accurate estimate There is also some theoretical evidence for this Stratification reduces the estimate’s variance But is this appropriate, or does it result in overconfident estimates Repeated stratified cross-validation E.g. Repeat 10xCV ten times and average results (reduces s2) Again, this leads to a false over confidence *For a different view, read: Dietterich. (1998) Approximate Statistical Tests for Comparing Supervised Learning Algorithms.

Leave-One-Out Cross-Validation Leave-One-Out, a particular form of cross-validation: Set number of folds to number of training instances I.e., for n training instances, build classifier n times Makes best use of the data Involves no random subsampling Can be very computationally expensive But some exception (e.g., NN)

The Bootstrap CV uses sampling without replacement The same instance, once selected, can not be selected again for a particular training/test set The bootstrap uses sampling with replacement to form the training set Sample a dataset of n instances n times with replacement to form a new dataset of n instances Use this data as the training set Use the instances from the original dataset that don’t occur in the new training set for testing

The Bootstrap Bootstrap Training Set Original Data Set x(1) x(1) x(3) 10 Total Examples x(10) x(10) x(10) x(10) 10 Examples in Bootstrap

The Bootstrap Bootstrap Training Set Original Data Set Test Set x(2)

The 0.632 Bootstrap Also called the 0.632 bootstrap A particular instance has a probability of (n-1)/n of not being picked Thus its odds of ending up in the test data are: This means the training data will contain approximately 63.2% of the instances