Download presentation

Presentation is loading. Please wait.

Published byJaden Batt Modified about 1 year ago

1
1 CSE 881: Data Mining Lecture 6: Model Overfitting and Classifier Evaluation Read Sections 4.4 – 4.6

2
2 Classification Errors l Training errors (apparent errors) –Errors committed on the training set l Test errors –Errors committed on the test set l Generalization errors –Expected error of a model over random selection of records from same distribution

3
3 Example Data Set Two class problem: +, o 3000 data points (30% for training, 70% for testing) Data set for + class is generated from a uniform distribution Data set for o class is generated from a mixture of 3 gaussian distributions, centered at (5,15), (10,5), and (15,15)

4
4 Decision Trees Decision Tree with 11 leaf nodesDecision Tree with 24 leaf nodes Which tree is better?

5
5 Model Overfitting Underfitting: when model is too simple, both training and test errors are large Overfitting: when model is too complex, training error is small but test error is large

6
6 Mammal Classification Problem Training Set Decision Tree Model training error = 0%

7
7 Effect of Noise Training Set: Test Set: Example: Mammal Classification problem Model M1: train err = 0%, test err = 30% Model M2: train err = 20%, test err = 10%

8
8 Lack of Representative Samples Lack of training records at the leaf nodes for making reliable classification Training Set: Test Set: Model M3: train err = 0%, test err = 30%

9
9 Effect of Multiple Comparison Procedure l Consider the task of predicting whether stock market will rise/fall in the next 10 trading days l Random guessing: P ( correct ) = 0.5 l Make 10 random guesses in a row: Day 1Up Day 2Down Day 3Down Day 4Up Day 5Down Day 6Down Day 7Up Day 8Up Day 9Up Day 10Down

10
10 Effect of Multiple Comparison Procedure l Approach: –Get 50 analysts –Each analyst makes 10 random guesses –Choose the analyst that makes the most number of correct predictions l Probability that at least one analyst makes at least 8 correct predictions

11
11 Effect of Multiple Comparison Procedure l Many algorithms employ the following greedy strategy: –Initial model: M –Alternative model: M’ = M , where is a component to be added to the model (e.g., a test condition of a decision tree) –Keep M’ if improvement, (M,M’) > l Often times, is chosen from a set of alternative components, = { 1, 2, …, k } l If many alternatives are available, one may inadvertently add irrelevant components to the model, resulting in model overfitting

12
12 Notes on Overfitting l Overfitting results in decision trees that are more complex than necessary l Training error no longer provides a good estimate of how well the tree will perform on previously unseen records l Need new ways for estimating generalization errors

13
13 Estimating Generalization Errors l Resubstitution Estimate l Incorporating Model Complexity l Estimating Statistical Bounds l Use Validation Set

14
14 Resubstitution Estimate l Using training error as an optimistic estimate of generalization error e(T L ) = 4/24 e(T R ) = 6/24

15
15 Incorporating Model Complexity l Rationale: Occam’s Razor –Given two models of similar generalization errors, one should prefer the simpler model over the more complex model –A complex model has a greater chance of being fitted accidentally by errors in data –Therefore, one should include model complexity when evaluating a model

16
16 Pessimistic Estimate Given a decision tree node t – n ( t ): number of training records classified by t – e ( t ): misclassification error of node t –Training error of tree T : : is the cost of adding a node N: total number of training records

17
17 Pessimistic Estimate e(T L ) = 4/24 e(T R ) = 6/24 = 1 e’(T L ) = (4 +7 1)/24 = 0.458 e’(T R ) = (6 + 4 1)/24 = 0.417

18
18 Minimum Description Length (MDL) l Cost(Model,Data) = Cost(Data|Model) + Cost(Model) –Cost is the number of bits needed for encoding. –Search for the least costly model. l Cost(Data|Model) encodes the misclassification errors. l Cost(Model) uses node encoding (number of children) plus splitting condition encoding.

19
19 Estimating Statistical Bounds Before splitting: e = 2/7, e’(7, 2/7, 0.25) = 0.503 e’(T) = 7 0.503 = 3.521 After splitting: e(T L ) = 1/4, e’(4, 1/4, 0.25) = 0.537 e(T R ) = 1/3, e’(3, 1/3, 0.25) = 0.650 e’(T) = 4 0.537 + 3 0.650 = 4.098

20
20 Using Validation Set l Divide training data into two parts: –Training set: use for model building –Validation set: use for estimating generalization error Note: validation set is not the same as test set l Drawback: –Less data available for training

21
21 Handling Overfitting in Decision Tree l Pre-Pruning (Early Stopping Rule) –Stop the algorithm before it becomes a fully-grown tree –Typical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the same –More restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using 2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain). Stop if estimated generalization error falls below certain threshold

22
22 Handling Overfitting in Decision Tree l Post-pruning –Grow decision tree to its entirety –Subtree replacement Trim the nodes of the decision tree in a bottom-up fashion If generalization error improves after trimming, replace sub-tree by a leaf node Class label of leaf node is determined from majority class of instances in the sub-tree –Subtree raising Replace subtree with most frequently used branch

23
23 Example of Post-Pruning Class = Yes20 Class = No10 Error = 10/30 Training Error (Before splitting) = 10/30 Pessimistic error = (10 + 0.5)/30 = 10.5/30 Training Error (After splitting) = 9/30 Pessimistic error (After splitting) = (9 + 4 0.5)/30 = 11/30 PRUNE! Class = Yes8 Class = No4 Class = Yes3 Class = No4 Class = Yes4 Class = No1 Class = Yes5 Class = No1

24
24 Examples of Post-pruning

25
25 Evaluating Performance of Classifier l Model Selection –Performed during model building –Purpose is to ensure that model is not overly complex (to avoid overfitting) –Need to estimate generalization error l Model Evaluation –Performed after model has been constructed –Purpose is to estimate performance of classifier on previously unseen data (e.g., test set)

26
26 Methods for Classifier Evaluation l Holdout –Reserve k% for training and (100-k)% for testing l Random subsampling –Repeated holdout l Cross validation –Partition data into k disjoint subsets –k-fold: train on k-1 partitions, test on the remaining one –Leave-one-out: k=n l Bootstrap –Sampling with replacement –.632 bootstrap:

27
27 Methods for Comparing Classifiers l Given two models: –Model M1: accuracy = 85%, tested on 30 instances –Model M2: accuracy = 75%, tested on 5000 instances l Can we say M1 is better than M2? –How much confidence can we place on accuracy of M1 and M2? –Can the difference in performance measure be explained as a result of random fluctuations in the test set?

28
28 Confidence Interval for Accuracy l Prediction can be regarded as a Bernoulli trial –A Bernoulli trial has 2 possible outcomes Coin toss – head/tail Prediction – correct/wrong –Collection of Bernoulli trials has a Binomial distribution: x Bin(N, p) x: number of correct predictions l Estimate number of events –Given N and p, find P(x=k) or E(x) –Example: Toss a fair coin 50 times, how many heads would turn up? Expected number of heads = N p = 50 0.5 = 25

29
29 Confidence Interval for Accuracy l Estimate parameter of distribution –Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances), –Find upper and lower bounds of p (true accuracy of model)

30
30 Confidence Interval for Accuracy l For large test sets (N > 30), –acc has a normal distribution with mean p and variance p(1-p)/N l Confidence Interval for p: Area = 1 - Z /2 Z 1- /2

31
31 Confidence Interval for Accuracy l Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: –N=100, acc = 0.8 –Let 1- = 0.95 (95% confidence) –From probability table, Z /2 =1.96 1- Z 0.992.58 0.982.33 0.951.96 0.901.65 N5010050010005000 p(lower)0.6700.7110.7630.7740.789 p(upper)0.8880.8660.8330.8240.811

32
32 Comparing Performance of 2 Models l Given two models, say M1 and M2, which is better? –M1 is tested on D1 (size=n1), found error rate = e 1 –M2 is tested on D2 (size=n2), found error rate = e 2 –Assume D1 and D2 are independent –If n1 and n2 are sufficiently large, then –Approximate:

33
33 Comparing Performance of 2 Models l To test if performance difference is statistically significant: d = e1 – e2 N –d ~ N(d t, t ) where d t is the true difference –Since D1 and D2 are independent, their variance adds up: –At (1- ) confidence level,

34
34 An Illustrative Example l Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25 l d = |e2 – e1| = 0.1 (2-sided test) l At 95% confidence level, Z /2 =1.96 => Interval contains 0 => difference may not be statistically significant

35
35 Comparing Performance of 2 Classifiers l Each classifier produces k models: –C1 may produce M11, M12, …, M1k –C2 may produce M21, M22, …, M2k l If models are applied to the same test sets D1,D2, …, Dk (e.g., via cross-validation) –For each set: compute d j = e 1j – e 2j –d j has mean d and variance t 2 –Estimate:

36
36 An Illustrative Example l 30-fold cross validation: –Average difference = 0.05 –Std deviation of difference = 0.002 l At 95% confidence level, t = 2.04 –Interval does not span the value 0, so difference is statistically significant

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google