Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.

Slides:



Advertisements
Similar presentations
Learning Algorithm Evaluation
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
Model Evaluation Metrics for Performance Evaluation
Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009.
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
1 I256: Applied Natural Language Processing Marti Hearst Sept 27, 2006.
Model Evaluation Instructor: Qiang Yang
Evaluation and Credibility
Determining the Size of
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
Evaluation of Learning Models
Data Mining – Credibility: Evaluating What’s Been Learned
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Evaluating Classifiers
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
Evaluation – next steps
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Performance measurement. Must be careful what performance metric we use For example, say we have a NN classifier with 1 output unit, and we code ‘1 =
1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
90288 – Select a Sample and Make Inferences from Data The Mayor’s Claim.
Experiments in Machine Learning COMP24111 lecture 5 Accuracy (%) A BC D Learning algorithm.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
CpSc 810: Machine Learning Evaluation of Classifier.
Evaluating Results of Learning Blaž Zupan
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
WEKA Machine Learning Toolbox. You can install Weka on your computer from
Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages
An Exercise in Machine Learning
Evaluating Classification Performance
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Genetic Algorithms (in 1 Slide) l GA: based on an analogy to biological evolution l Each.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluation of learned models Kurt Driessens again with slides stolen from Evgueni Smirnov and Hendrik Blockeel.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.
Information Organization: Evaluation of Classification Performance.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Data Science Credibility: Evaluating What’s Been Learned
Evaluation – next steps
Performance Evaluation 02/15/17
Evaluating Results of Learning
Data Mining – Credibility: Evaluating What’s Been Learned
9. Credibility: Evaluating What’s Been Learned
Evaluation and Its Methods
Learning Algorithm Evaluation
CSCI N317 Computation for Scientific Applications Unit Weka
Model Evaluation and Selection
Computational Intelligence: Methods and Applications
Evaluation and Its Methods
Presentation transcript:

Evaluating What’s Been Learned

Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for training Separation should NOT be “convenience”, –Should at least be random –Better – “ stratified ” random – division preserves relative proportion of classes in both training and test data Enhanced : repeated holdout –Enables using more data in training, while still getting a good test 10-fold cross validation has become standard This is improved if the folds are chosen in a “stratified” random way

For Small Datasets Leave One Out Bootstrapping To be discussed in turn

Leave One Out Train on all but one instance, test on that one (pct correct always equals 100% or 0%) Repeat until have tested on all instances, average results Really equivalent to N-fold cross validation where N = number of instances available Plusses: –Always trains on maximum possible training data (without cheating) –Efficient to run – no repeated (since fold contents not randomized) –No stratification, no random sampling necessary Minuses –Guarantees a non-stratified sample – the correct class will always be at least a little bit under-represented in the training data –Statistical tests are not appropriate

Bootstrapping Sampling done with replacement to form a training dataset Particular approach – bootstrap –Dataset of n instances is sampled n times –Some instances will be included multiple times –Those not picked will be used as test data –On large enough dataset,.632 of the data instances will end up in the training dataset, rest will be in test This is a bit of a pessimistic estimate of performance, since only using 63% of data for training (vs 90% in 10-fold cross validation) May try to balance by weighting in performance predicting training data (p 129) This procedure can be repeated any number of times, allowing statistical tests

Counting the Cost Some mistakes are more costly to make than others Giving a loan to a defaulter is more costly than denying somebody who would be a good customer Sending mail solicitation to somebody who won’t buy is less costly than missing somebody who would buy (opportunity cost) Looking at a confusion matrix, each position could have an associated cost (or benefit from correct positions) Measurement could be average profit/ loss per prediction To be fair in cost benefit analysis, should also factor in cost of collecting and preparing the data, building the model …

Lift Charts In practice, costs are frequently not known Decisions may be made by comparing possible scenarios Book Example – Promotional Mailing –Situation 1 – previous experience predicts that 0.1% of all (1,000,000) households will respond –Situation 2 – classifier predicts that 0.4% of the most promising households will respond –Situation 3 – classifier predicts that 0.2% of the most promising households will respond –The increase in response rate is the lift ( 0.4 / 0.1 = 4 in situation 2 compared to sending to all)

Information Retrieval (IR) Measures E.g., Given a WWW search, a search engine produces a list of hits supposedly relevant Which is better? –Retrieving 100, of which 40 are actually relevant –Retrieving 400, of which 80 are actually relevant –Really depends on the costs

Information Retrieval (IR) Measures IR community has developed 3 measures: –Recall = number of documents retrieved that are relevant total number of documents that are relevant –Precision = number of documents retrieved that are relevant total number of documents that are retrieved –F-measure = 2 * recall * precision recall + precision

WEKA Part of the results provided by WEKA (that we’ve ignored so far) Let’s look at an example (Naïve Bayes on my-weather-nominal) === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class yes no === Confusion Matrix === a b <-- classified as 4 2 | a = yes 1 7 | b = no TP rate and recall are the same = TP / (TP + FN) –For Yes = 4 / (4 + 2); For No = 7 / (7 + 1) FP rate = FP / (FP + TN) – For Yes = 1 / (1 + 7); For No = 2 / (2 + 4) Precision = TP / (TP + FP) – For yes = 4 / (4 + 1); For No = 7 / (7 + 2) F-measure = 2TP / (2TP + FP + FN) –For Yes = 2*4 / (2* ) = 8 / 11 –For No = 2 * 7 / (2* ) = 14/17

In terms of true positives etc True positives = TP; False positives = FP True Negatives = TN; False negatives = FN Recall = TP / (TP + FN) // true positives / actually positive Precision = TP / (TP + FP) // true positives / predicted positive F-measure = 2TP / (2TP + FP + FN) –This has been generated using algebra from the formula previous –Easier to understand this way – correct predictions are double counted – once for recall, once for precision. denominator includes corrects and incorrects (either based on recall or precision idea – relevant but not retrieved or retrieved but not relevant) There is no mathematics that says recall and precision can be combined this way – it is ad hoc – but it does balance the two

Sensitivity and Specificity Used by medics to refer to tests Sensitivity = pct of people with the disease who have a positive test = recall TP / (TP + FN) Specificity = pct of people WITHOUT the disease who have a negative test 1 – ( FP / (FP + TN) ) = TN / (FP + TN) Sometimes balanced by multiplying these TP * T N. (TP + FN) (FP + TN)

WEKA For many occasions, this borders on “too much information”, but it’s all there We can decide, are we more interested in Yes, or No? Are we more interested in recall or precision?

WEKA – with more than two classes Contact Lenses with Naïve Bayes === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class soft hard none === Confusion Matrix === a b c <-- classified as | a = soft | b = hard | c = none Class exercise – show how to calculate recall, precision, f- measure for each class

Answers Recall = TP / (TP + FN) –Soft = 4 / (4 + (0 + 1)) =.8 –Hard = 1 / (1 + (0 + 3)) =.25 –None = 12 / (12 + (1 + 2)) =.8 Precision = TP / (TP + FP) –Soft = 4 / (4 + (0 + 1)) =.8 –Hard = 1 / (1 + (0 + 2)) =.33 –None = 12 / (12 + (1 + 3)) =.75 F-measure = 2TP / (2TP + FP + FN) –Soft = 2 * 4 / ( 2*4 + (0 + 1) + (0 + 1)) = 8/10 =.8 –Hard = 2 * 1 / (2*1 + (0 + 2) + (0 + 3)) = 2 / 7=.286 –None = 2 * 12 / (2*12 + (1 + 3) + (1 + 2)) = 24 / 31 =.774

Applying Action Rules to change Detractor to Passive /Accuracy- Precision, Coverage- Recall/ Let’s assume that we built action rules from the classifiers for Promoter & Detractor. The goal is to change Detractors -> Promoters The confidence of action rule – * = 0.84 Our action rule can target only 4.2 (out of 10.2) detractors. So, we can expect 4.2*0.84 = 3.52 detractors moving to the promoter status