Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.

Similar presentations


Presentation on theme: "Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for."— Presentation transcript:

1 Evaluating What’s Been Learned

2 Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for training Separation should NOT be “convenience”, –Should at least be random –Better – “ stratified ” random – division preserves relative proportion of classes in both training and test data Enhanced : repeated holdout –Enables using more data in training, while still getting a good test 10-fold cross validation has become standard This is improved if the folds are chosen in a “stratified” random way

3 For Small Datasets Leave One Out Bootstrapping To be discussed in turn

4 Leave One Out Train on all but one instance, test on that one (pct correct always equals 100% or 0%) Repeat until have tested on all instances, average results Really equivalent to N-fold cross validation where N = number of instances available Plusses: –Always trains on maximum possible training data (without cheating) –Efficient to run – no repeated (since fold contents not randomized) –No stratification, no random sampling necessary Minuses –Guarantees a non-stratified sample – the correct class will always be at least a little bit under-represented in the training data –Statistical tests are not appropriate

5 Bootstrapping Sampling done with replacement to form a training dataset Particular approach – 0.632 bootstrap –Dataset of n instances is sampled n times –Some instances will be included multiple times –Those not picked will be used as test data –On large enough dataset,.632 of the data instances will end up in the training dataset, rest will be in test This is a bit of a pessimistic estimate of performance, since only using 63% of data for training (vs 90% in 10-fold cross validation) May try to balance by weighting in performance predicting training data (p 129) This procedure can be repeated any number of times, allowing statistical tests

6 Counting the Cost Some mistakes are more costly to make than others Giving a loan to a defaulter is more costly than denying somebody who would be a good customer Sending mail solicitation to somebody who won’t buy is less costly than missing somebody who would buy (opportunity cost) Looking at a confusion matrix, each position could have an associated cost (or benefit from correct positions) Measurement could be average profit/ loss per prediction To be fair in cost benefit analysis, should also factor in cost of collecting and preparing the data, building the model …

7 Lift Charts In practice, costs are frequently not known Decisions may be made by comparing possible scenarios Book Example – Promotional Mailing –Situation 1 – previous experience predicts that 0.1% of all (1,000,000) households will respond –Situation 2 – classifier predicts that 0.4% of the 100000 most promising households will respond –Situation 3 – classifier predicts that 0.2% of the 400000 most promising households will respond –The increase in response rate is the lift ( 0.4 / 0.1 = 4 in situation 2 compared to sending to all)

8 Information Retrieval (IR) Measures E.g., Given a WWW search, a search engine produces a list of hits supposedly relevant Which is better? –Retrieving 100, of which 40 are actually relevant –Retrieving 400, of which 80 are actually relevant –Really depends on the costs

9 Information Retrieval (IR) Measures IR community has developed 3 measures: –Recall = number of documents retrieved that are relevant total number of documents that are relevant –Precision = number of documents retrieved that are relevant total number of documents that are retrieved –F-measure = 2 * recall * precision recall + precision

10 WEKA Part of the results provided by WEKA (that we’ve ignored so far) Let’s look at an example (Naïve Bayes on my-weather-nominal) === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.667 0.125 0.8 0.667 0.727 yes 0.875 0.333 0.778 0.875 0.824 no === Confusion Matrix === a b <-- classified as 4 2 | a = yes 1 7 | b = no TP rate and recall are the same = TP / (TP + FN) –For Yes = 4 / (4 + 2); For No = 7 / (7 + 1) FP rate = FP / (FP + TN) – For Yes = 1 / (1 + 7); For No = 2 / (2 + 4) Precision = TP / (TP + FP) – For yes = 4 / (4 + 1); For No = 7 / (7 + 2) F-measure = 2TP / (2TP + FP + FN) –For Yes = 2*4 / (2*4 + 1 + 2) = 8 / 11 –For No = 2 * 7 / (2*7 + 2 + 1) = 14/17

11 In terms of true positives etc True positives = TP; False positives = FP True Negatives = TN; False negatives = FN Recall = TP / (TP + FN) // true positives / actually positive Precision = TP / (TP + FP) // true positives / predicted positive F-measure = 2TP / (2TP + FP + FN) –This has been generated using algebra from the formula previous –Easier to understand this way – correct predictions are double counted – once for recall, once for precision. denominator includes corrects and incorrects (either based on recall or precision idea – relevant but not retrieved or retrieved but not relevant) There is no mathematics that says recall and precision can be combined this way – it is ad hoc – but it does balance the two

12 Sensitivity and Specificity Used by medics to refer to tests Sensitivity = pct of people with the disease who have a positive test = recall TP / (TP + FN) Specificity = pct of people WITHOUT the disease who have a negative test 1 – ( FP / (FP + TN) ) = TN / (FP + TN) Sometimes balanced by multiplying these TP * T N. (TP + FN) (FP + TN)

13 WEKA For many occasions, this borders on “too much information”, but it’s all there We can decide, are we more interested in Yes, or No? Are we more interested in recall or precision?

14 WEKA – with more than two classes Contact Lenses with Naïve Bayes === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.8 0.053 0.8 0.8 0.8 soft 0.25 0.1 0.333 0.25 0.286 hard 0.8 0.444 0.75 0.8 0.774 none === Confusion Matrix === a b c <-- classified as 4 0 1 | a = soft 0 1 3 | b = hard 1 2 12 | c = none Class exercise – show how to calculate recall, precision, f- measure for each class

15 Answers Recall = TP / (TP + FN) –Soft = 4 / (4 + (0 + 1)) =.8 –Hard = 1 / (1 + (0 + 3)) =.25 –None = 12 / (12 + (1 + 2)) =.8 Precision = TP / (TP + FP) –Soft = 4 / (4 + (0 + 1)) =.8 –Hard = 1 / (1 + (0 + 2)) =.33 –None = 12 / (12 + (1 + 3)) =.75 F-measure = 2TP / (2TP + FP + FN) –Soft = 2 * 4 / ( 2*4 + (0 + 1) + (0 + 1)) = 8/10 =.8 –Hard = 2 * 1 / (2*1 + (0 + 2) + (0 + 3)) = 2 / 7=.286 –None = 2 * 12 / (2*12 + (1 + 3) + (1 + 2)) = 24 / 31 =.774

16 Applying Action Rules to change Detractor to Passive /Accuracy- Precision, Coverage- Recall/ Let’s assume that we built action rules from the classifiers for Promoter & Detractor. The goal is to change Detractors -> Promoters The confidence of action rule – 0.993 * 0.849 = 0.84 Our action rule can target only 4.2 (out of 10.2) detractors. So, we can expect 4.2*0.84 = 3.52 detractors moving to the promoter status


Download ppt "Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for."

Similar presentations


Ads by Google