1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.

Slides:



Advertisements
Similar presentations
Evaluating Classifiers
Advertisements

Learning Algorithm Evaluation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Evaluating Classifiers Lecture 2 Instructor: Max Welling Read chapter 5.
© sebastian thrun, CMU, The KDD Lab Intro: Outcome Analysis Sebastian Thrun Carnegie Mellon University
Evaluation.
Model Evaluation Instructor: Qiang Yang
Evaluating Hypotheses
Evaluation and Credibility How much should we believe in what was learned?
Experimental Evaluation
Evaluation and Credibility
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
2015/7/15University of Waikato1 Significance tests Significance tests tell us how confident we can be that there really is a difference Null hypothesis:
Evaluation of Learning Models
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
CLassification TESTING Testing classifier accuracy
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Evaluation – next steps
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
1 Machine Learning: Experimental Evaluation. 2 Motivation Evaluating the performance of learning systems is important because: –Learning systems are usually.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
MS 305 Recitation 11 Output Analysis I
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experimental Evaluation of Learning Algorithms Part 1.
Learning from Observations Chapter 18 Through
 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 5.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Evaluating Results of Learning Blaž Zupan
CpSc 881: Machine Learning Evaluating Hypotheses.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师:陈昱 Tel :  助教:程再兴, Tel :  课程网页:
Machine Learning Chapter 5. Evaluating Hypotheses
1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages
Evaluating Classification Performance
Validation methods.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased – error:
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Credibility: Evaluating What’s Been Learned Predicting.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
Data Science Credibility: Evaluating What’s Been Learned
7. Performance Measurement
Evaluating Hypotheses
Evaluating Results of Learning
9. Credibility: Evaluating What’s Been Learned
Evaluating Classifiers
Machine Learning Techniques for Data Mining
Computational Learning Theory
Learning Algorithm Evaluation
Evaluating Hypotheses
Evaluating Hypothesis
Machine Learning: Lecture 5
Presentation transcript:

1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and E. Frank, Data Mining, chapter 5 I.H. Witten and E. Frank, Data Mining, chapter 5

2 data Target data Processed data Transformed data Patterns Knowledge Selection Preprocessing & cleaning Transformation & feature selection Data Mining Interpretation Evaluation Fayyad’s KDD Methodology

3 Overview of the lecture Evaluating Hypotheses (errors, accuracy) Evaluating Hypotheses (errors, accuracy) Comparing Hypotheses Comparing Hypotheses Comparing Learning Algorithms (hold-out methods) Comparing Learning Algorithms (hold-out methods) Performance Measures Performance Measures Varia (Occam’s razor, warning) Varia (Occam’s razor, warning) No Free Lunch No Free Lunch

4 Evaluating Hypotheses: Two definitions of error True error of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random according to D. True error of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random according to D.

5 Two definitions of error (2) Sample error of hypothesis h with respect to target function f and data sample S is the proportion of examples h misclassifies where is 1 if and 0 otherwise. Sample error of hypothesis h with respect to target function f and data sample S is the proportion of examples h misclassifies where is 1 if and 0 otherwise.

6 Two definitions of error (3) How well does error S (h) estimate error D (h)? How well does error S (h) estimate error D (h)?

7 Problems estimating error 1. Bias: If S is training set, error S (h) is optimistically biased For unbiased estimate, h and S must be chosen independently. 1. Bias: If S is training set, error S (h) is optimistically biased For unbiased estimate, h and S must be chosen independently. 2. Variance: Even with unbiased S, error S (h) may still vary from error D (h). 2. Variance: Even with unbiased S, error S (h) may still vary from error D (h).

8 Example Hypothesis h misclassifies 12 of the 40 examples in S What is error D (h)? Hypothesis h misclassifies 12 of the 40 examples in S What is error D (h)?

9 Estimators Experiment: 1. choose sample S of size n according to distribution D 2. measure error S (h) error S (h) is a random variable (i.e., result of an experiment) error S (h) is an unbiased estimator for error D (h) Given observed error S (h) what can we conclude about error D (h)? Experiment: 1. choose sample S of size n according to distribution D 2. measure error S (h) error S (h) is a random variable (i.e., result of an experiment) error S (h) is an unbiased estimator for error D (h) Given observed error S (h) what can we conclude about error D (h)?

10 Confidence intervals If If S contains n examples, drawn independently of h and each other S contains n examples, drawn independently of h and each other n  30 n  30 Then Then With approximately 95% probability, error D (h) lies in the interval With approximately 95% probability, error D (h) lies in the interval

11 Confidence intervals (2) If If S contains n examples, drawn independently of h and each other S contains n examples, drawn independently of h and each other n  30 n  30 Then Then With approximately N% probability, error D (h) lies in the interval where N%:50%68%80%90%95%98%99% z N : With approximately N% probability, error D (h) lies in the interval where N%:50%68%80%90%95%98%99% z N :

12 error S (h) is a random variable Rerun the experiment with different randomly drawn S (of size n) Rerun the experiment with different randomly drawn S (of size n) Probability of observing r misclassified examples: Probability of observing r misclassified examples:

13 Binomial probability distribution Probability P(r) of r heads in n coin flips, if p = Pr(heads) Probability P(r) of r heads in n coin flips, if p = Pr(heads) Binomial distribution for n = 10 and p = 0.3

14 Binomial probability distribution (2) Expected, or mean value of X, E[X], is Expected, or mean value of X, E[X], is Variance of X is Variance of X is Standard deviation of X,  X, is Standard deviation of X,  X, is

15 Normal distribution approximates binomial error S (h) follows a Binomial distribution, with error S (h) follows a Binomial distribution, with mean mean standard deviation standard deviation Approximate this by a Normal distribution with Approximate this by a Normal distribution with mean mean standard deviation standard deviation

16 Normal probability distribution The probability that X will fall into the interval (a,b) is given by The probability that X will fall into the interval (a,b) is given by Expected, or mean value of X, E[X], is E[X] =  Expected, or mean value of X, E[X], is E[X] =  Variance of X is Var(x) =  2 Variance of X is Var(x) =  2 Standard deviation of X,  X,  X =  Standard deviation of X,  X,  X = 

17 Normal probability distribution 80% of area (probability) lies in   1.28 80% of area (probability) lies in   1.28 N% of area (probability) lies in   z N  N%:50%68%80%90%95%98%99% z N : N% of area (probability) lies in   z N  N%:50%68%80%90%95%98%99% z N :

18 Confidence intervals, more correctly If If S contains n examples, drawn independently of h and each other S contains n examples, drawn independently of h and each other n  30 n  30 Then Then with approximately 95% probability, error S (h) lies in the interval with approximately 95% probability, error S (h) lies in the interval and error D (h) approximately lies in the interval and error D (h) approximately lies in the interval

19 Central Limit Theorem Consider a set of independent, identically distributed random variables Y 1...Y n, all governed by an arbitrary probability distribution with mean  and finite variance  2. Define the sample mean, Consider a set of independent, identically distributed random variables Y 1...Y n, all governed by an arbitrary probability distribution with mean  and finite variance  2. Define the sample mean, Central Limit Theorem. As n  , the distribution governing approaches a Normal distribution, with mean  and variance  2 /n. Central Limit Theorem. As n  , the distribution governing approaches a Normal distribution, with mean  and variance  2 /n.

20 Comparing Hypotheses: Difference between hypotheses Test h 1 on sample S 1, test h 2 on S 2 Test h 1 on sample S 1, test h 2 on S 2 1. Pick parameter to estimate 1. Pick parameter to estimate 2. Choose an estimator 2. Choose an estimator 3. Determine probability distribution that governs estimator 3. Determine probability distribution that governs estimator

21 Difference between hypotheses (2) 4. Find interval (L, U) such that N% of probability mass falls in the interval 4. Find interval (L, U) such that N% of probability mass falls in the interval

22 Paired t test to compare h A, h B 1. Partition data into k disjoint test sets T 1,T 2,…,T k of equal size, where this size is at least Partition data into k disjoint test sets T 1,T 2,…,T k of equal size, where this size is at least For i from 1 to k, do 2. For i from 1 to k, do 3. Return the value, where 3. Return the value, where

23 Paired t test to compare h A, h B (2) N% confidence interval estimate for d: N% confidence interval estimate for d: Note approximately Normally distributed Note approximately Normally distributed

24 Comparing learning algorithms: L A and L B What we’d like to estimate: where L(S) is the hypothesis output by learner L using training set S What we’d like to estimate: where L(S) is the hypothesis output by learner L using training set S I.e., the expected difference in true error between hypotheses output by learners L A and L B, when trained using randomly selected training sets S drawn according to distribution D. I.e., the expected difference in true error between hypotheses output by learners L A and L B, when trained using randomly selected training sets S drawn according to distribution D.

25 Comparing learning algorithms L A and L B (2) But, given limited data D 0, what is a good estimator? But, given limited data D 0, what is a good estimator? Could partition D 0 into training set S 0 and test set T 0, and measure Could partition D 0 into training set S 0 and test set T 0, and measure Even better, repeat this many times and average the results Even better, repeat this many times and average the results

26 Comparing learning algorithms L A and L B (3): k-fold cross validation 1. Partition data D 0 into k disjoint test sets T 1,T 2,…,T k of equal size, where this size is at least Partition data D 0 into k disjoint test sets T 1,T 2,…,T k of equal size, where this size is at least For i from 1 to k, do use T i for the test set, and the remaining data for training set S i 2. For i from 1 to k, do use T i for the test set, and the remaining data for training set S i 3. Return the average of the errors on the test sets 3. Return the average of the errors on the test sets

27 Practical Aspects A note on parameter tuning It is important that the test data is not used in any way to create the classifier It is important that the test data is not used in any way to create the classifier Some learning schemes operate in two stages: Some learning schemes operate in two stages: Stage 1: builds the basic structure Stage 1: builds the basic structure Stage 2: optimizes parameter settings Stage 2: optimizes parameter settings The test data can’t be used for parameter tuning! The test data can’t be used for parameter tuning! Proper procedure uses three sets: training data, Proper procedure uses three sets: training data, validation data, and test data validation data, and test data Validation data is used to optimize parameters Validation data is used to optimize parameters

28 Holdout estimation, stratification What shall we do if the amount of data is limited? What shall we do if the amount of data is limited? The holdout method reserves a certain amount for The holdout method reserves a certain amount for testing and uses the remainder for training testing and uses the remainder for training Usually: one third for testing, the rest for training Usually: one third for testing, the rest for training Problem: the samples might not be representative Problem: the samples might not be representative Example: class might be missing in the test data Example: class might be missing in the test data Advanced version uses stratification Advanced version uses stratification Ensures that each class is represented with approximately equal proportions in both subsets Ensures that each class is represented with approximately equal proportions in both subsets

29 More on cross-validation Standard method for evaluation: stratified ten-fold cross-validation Standard method for evaluation: stratified ten-fold cross-validation Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate There is also some theoretical evidence for this There is also some theoretical evidence for this Stratification reduces the estimate’s variance Stratification reduces the estimate’s variance Even better: repeated stratified cross-validation Even better: repeated stratified cross-validation E.g. ten-fold cross-validation is repeated ten times E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance) and results are averaged (reduces the variance)

30 Estimation of the accuracy of a learning algorithm 10-fold cross validation gives a pessimistic estimate of the accuracy of the hypothesis build on all training data, provided that the law “the more training data the better” holds. 10-fold cross validation gives a pessimistic estimate of the accuracy of the hypothesis build on all training data, provided that the law “the more training data the better” holds. For model selection 10-fold cross validation often works fine. For model selection 10-fold cross validation often works fine. An other method is: leave-one-out or jackknife (N-fold cross validation with N = training set size). An other method is: leave-one-out or jackknife (N-fold cross validation with N = training set size). Also the standard deviation is essential for comparing learning algorithms. Also the standard deviation is essential for comparing learning algorithms.

31 Performance Measures: Issues in evaluation Statistical reliability of estimated differences in performance Statistical reliability of estimated differences in performance Choice of performance measure Choice of performance measure Number of correct classifications Number of correct classifications Accuracy of probability estimates Accuracy of probability estimates Error in numeric predictions Error in numeric predictions Costs assigned to different types of errors Costs assigned to different types of errors Many practical applications involve costs Many practical applications involve costs

32 Counting the costs In practice, different types of classification errors often incur different costs In practice, different types of classification errors often incur different costs Examples: Examples: Predicting when cows are in heat (“in estrus”) Predicting when cows are in heat (“in estrus”) “Not in estrus” correct 97% of the time “Not in estrus” correct 97% of the time Loan decisions Loan decisions Oil-slick detection Oil-slick detection Fault diagnosis Fault diagnosis Promotional mailing Promotional mailing

33 Taking costs into account The confusion matrix: The confusion matrix: There are many other types of costs! There are many other types of costs! E.g.: costs of collecting training data E.g.: costs of collecting training data predicted class actualclass yesno yes True positive False negative no False positive True negative

34 Lift charts In practice, costs are rarely known In practice, costs are rarely known Decisions are usually made by comparing possible scenarios Decisions are usually made by comparing possible scenarios Example: promotional mailout Example: promotional mailout Situation 1: classifier predicts that 0.1% of all households will respond Situation 1: classifier predicts that 0.1% of all households will respond Situation 2: classifier predicts that 0.4% of the most promising households will respond Situation 2: classifier predicts that 0.4% of the most promising households will respond A lift chart allows for a visual comparison A lift chart allows for a visual comparison

35 Generating a lift chart Instances are sorted according to their predicted probability of being a true positive: RankPredicted probabilityActual class 10.95Yes 20.93Yes 30.93No 40.88Yes ……… Instances are sorted according to their predicted probability of being a true positive: RankPredicted probabilityActual class 10.95Yes 20.93Yes 30.93No 40.88Yes ……… In lift chart, x axis is sample size and y axis is number of true positives In lift chart, x axis is sample size and y axis is number of true positives

36 A hypothetical lift chart

37 Probabilities, reliability In order to generate a lift chart we need information that tells that one classification is more probable/reliable to be of the class of interest than another classification. In order to generate a lift chart we need information that tells that one classification is more probable/reliable to be of the class of interest than another classification. The Naïve Bayes classifier and also Nearest Neighbor classifiers can output such information. The Naïve Bayes classifier and also Nearest Neighbor classifiers can output such information. If we start to fill our subset with the most probable examples, the subset will contain a larger proportion desired elements. If we start to fill our subset with the most probable examples, the subset will contain a larger proportion desired elements.

38 Summary of measures

39 Varia: Model selection criteria Model selection criteria attempt to find a good compromise between: A. The complexity of a model B. Its prediction accuracy on the training data Model selection criteria attempt to find a good compromise between: A. The complexity of a model B. Its prediction accuracy on the training data Reasoning: a good model is a simple model that achieves high accuracy on the given data Reasoning: a good model is a simple model that achieves high accuracy on the given data Also known as Occam’s Razor: the best theory is the smallest one that describes all the facts Also known as Occam’s Razor: the best theory is the smallest one that describes all the facts

40 Warning Suppose you are gathering hypotheses that have a probability of 95% to have an error level below 10% Suppose you are gathering hypotheses that have a probability of 95% to have an error level below 10% What if you have found 100 hypotheses satisfying this condition? What if you have found 100 hypotheses satisfying this condition? Then the probability that all have an error below 10% is equal to (0.95) 100  corresponding to 1.3 %. So, the probability of having at least one hypothesis with an error above 10% is about 98.7%! Then the probability that all have an error below 10% is equal to (0.95) 100  corresponding to 1.3 %. So, the probability of having at least one hypothesis with an error above 10% is about 98.7%!

41 No Free Lunch!! Theorem (no free lunch) Theorem (no free lunch) For any two learning algorithms L A and L B the following is true, independently of the sampling distribution and the number of training instances n: For any two learning algorithms L A and L B the following is true, independently of the sampling distribution and the number of training instances n: Uniformly averaged over all target functions F, E(error S (h A ) | F, n) = E(error S (h B ) | F, n) Uniformly averaged over all target functions F, E(error S (h A ) | F, n) = E(error S (h B ) | F, n) Idem for a fixed training set D Idem for a fixed training set D See: Mitchell ch.2 See: Mitchell ch.2

42 No Free Lunch (2) Sketch of proof: Sketch of proof: If all functions are possible, for each function F for which L A outperforms L B a function F’ can be found with the opposite conclusion. If all functions are possible, for each function F for which L A outperforms L B a function F’ can be found with the opposite conclusion. Conclusion: Without assumptions on the underlying functions, the hypothesis space, no generalization is possible! Conclusion: Without assumptions on the underlying functions, the hypothesis space, no generalization is possible! In order to generalize, machine learning algorithms need some kind of bias: In order to generalize, machine learning algorithms need some kind of bias: inductive bias : not all possible functions are in the hypothesis space (restriction bias) or not all possible functions will be found because of the search strategy (preference or search bias) (Mitchell, ch. 2, 3). inductive bias : not all possible functions are in the hypothesis space (restriction bias) or not all possible functions will be found because of the search strategy (preference or search bias) (Mitchell, ch. 2, 3).

43 No free lunch (3) The other way around: for any ML algorithm there exist data sets on which it performs well and there exist data sets on which it performs badly! The other way around: for any ML algorithm there exist data sets on which it performs well and there exist data sets on which it performs badly! We hope that the latter sets do not occur too often in real life. We hope that the latter sets do not occur too often in real life.

44 Summary of notions True error, sample errorTrue error, sample error Bias of sample errorBias of sample error Accuracy, confidence intervalsAccuracy, confidence intervals Central limit theoremCentral limit theorem Paired t testPaired t test k-fold cross validation, leave-one-out, holdoutk-fold cross validation, leave-one-out, holdout StratificationStratification Training set, validation set, and test setTraining set, validation set, and test set Confusion matrix, TP, FP, TN, FNConfusion matrix, TP, FP, TN, FN Lift chart, ROC curve, Recall-precision curveLift chart, ROC curve, Recall-precision curve Occam’s razorOccam’s razor No free lunchNo free lunch Inductive bias; restriction bias; search biasInductive bias; restriction bias; search bias