Presentation is loading. Please wait.

Presentation is loading. Please wait.

Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased – error:

Similar presentations


Presentation on theme: "Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased – error:"— Presentation transcript:

1 Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased – error: error train (h) = #misclassifications/|S train | – error D (h) ≥ error train (h) could set aside a random subset of data for testing – sample error for any finite sample S drawn randomly from D is unbiased, but not necessarily same as true error – err S (h) ≠ err D (h) what we want is estimate of “true” accuracy over distribution D

2 Confidence Intervals put a bound on error D (h) based on Binomial Distribution – suppose sample error rate is error S (h)=p – then 95% CI for error D (h) is – E[error D (h)] = error S (h) = p – E[var(error D (h))] = p(1-p)/n – standard deviation =  var; var =  2 – 1.96  comes from confidence level (95%)

3 Binomial Distribution put a bound on error D (h) based on Binomial distribution – suppose true error rate is error D (h)=p – on a sample of size n, would expect np errors on average, but could vary around that due to sampling (error rate, as a proportion:)

4 Hypothesis Testing is error D (h)<0.2 (is error rate of h less than 20%?) – example: is better than majority classifier? (suppose error maj (h)=20%) if we approximate Binomial as Normal, then  ±2  should bound 95% of likely range for error D (h) two-tailed test: – risk of true error being higher or lower is 5% – Pr[Type I error]≤0.05 restrictions: n≥30 or np(1-p)≥5

5 Gaussian Distribution  1.28  = 80% of distr. z-score: relative distance of a value x from mean

6 for a one-tailed test, use z value for  /2 for example suppose error S (h)=0.19 and  suppose you want 95% confidence that error D (h)<20%, then test if 0.2-error S (h)>1.64  1.64 comes from z-score for  =90%

7 notice that confidence interval on error rate tightens with larger sample sizes – example: compare 2 trials that have 10% error – test set A has 100 examples, h makes 10 errors: 10/100 sqrt(.1x.9/100)=0.03 CI 95% (err(h)) = [10±6%] = [4-16%] – test set B has 100,000 examples, 10,000 errors: 10,000/100,000=sqrt(.1x.9/100,000)=0.00095 CI 95% (err(h)) = [10±0.19%] = [9.8-10.2%]

8 Comparing 2 hypotheses (decision trees) – test whether 0 is in conf. interval of difference – add variances example...

9 Estimating the accuracy of a Learning Algorithm error S (h) is the error rate of a particular hypothesis, which depends on training data what we want is estimate of error on any training set drawn from distribution we could repeat the splitting of data into independent training/testing sets, build and test k decisions trees, and take average note that this is a biased estimator, probably under- estimates true accuracy because uses less examples – this is a disadvantage of CV: building d-trees with only 90% of the data – (and it takes 10 times as long)

10 k-fold Cross-Validation (typically k=10) partition the dataset D into k subsets of equal size (  30), T 1..T k for i from 1 to k do: S i = D-T i // training set, 90% of D h i = L(S i ) // build decision tree e i = error(h i,T i ) // test d-tree on 10% held out  = (1/k)  e i  =  (1/k)  e i -  ) 2 SE =  k   (1/k(k-1))  e i -  ) 2 CI 95 =  t dof,   SE (t dof,  =2.23 for k=10 and  =95%)

11 what to do with 10 accuracies from CV? – accuracy of alg is just the mean (1/k)  acc i – for CI, use “standard error” (SE):  =  (1/k)  e i -  ) 2 SE =  k   (1/k(k-1))  e i -  ) 2 standard deviation for estimate of the mean – 95% CI =  ± t dof,   (1/k(k-1))  e i -  ) 2 Central Limit Theorem – we are estimating a “statistic” (parameter of a distribution, e.g. the mean) from multiple trials – regardless of underlying distribution, estimate of the mean approaches a Normal distribution – if std. dev. of underlying distribution is , then std. dev. of mean of distribution is  /  n

12 example: multiple trials of testing the accuracy of a learner, assuming true acc=70% and  =7% there is intrinsic variability in accuracy between different trials with more trials, distribution converges to underlying (std. dev. stays around 7) but the estimate of the mean (vertical bars,  2  /  n) gets tighter est of true mean= 71.0  2.5 est of true mean= 70.5  0.6 est of true mean= 69.98  0.03

13 Student’s T distribution is similar to Normal distr., but adjusted for small sample size; dof = k-1 example: t 9,0.05  2.23 (Table 5.6)

14 Comparing 2 Learning Algorithms – e.g. ID3 with 2 different pruning methods approach 1: – run each algorithm 10 times (using CV) independently to get CI for acc of each alg acc(A), SE(A) acc(B), SE(A) – T-test: statistical test if difference means ≠ 0 d=acc(A)-acc(B) – problem: the variance is additive (unpooled)

15 suppose mean acc for A is 61%  2, mean acc for B is 64%  2 58 60 62 64 66 68 d=acc(B)-acc(A) mean = 3% SE =  3.7 (just a guess) -3 0 3 6 9 acc(L A,T i ) acc(L B,T i )d=B-A 5859+2 6263+1 5963+4 6063+3 6264+2 5962+3 60%63%+3%, SE=1% -3 0 3 6 9 mean diff is same but B is systematically higher than A

16 approach 2: Paired T-test – run the algorithms in parallel on the same divisions of tests test whether 0 is in CI of differences:


Download ppt "Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased – error:"

Similar presentations


Ads by Google