Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased – error:

Slides:



Advertisements
Similar presentations
Previous Lecture: Distributions. Introduction to Biostatistics and Bioinformatics Estimation I This Lecture By Judy Zhong Assistant Professor Division.
Advertisements

Welcome to PHYS 225a Lab Introduction, class rules, error analysis Julia Velkovska.
Evaluating Classifiers
Sampling: Final and Initial Sample Size Determination
Sampling Distributions (§ )
Objectives Look at Central Limit Theorem Sampling distribution of the mean.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
1 Test a hypothesis about a mean Formulate hypothesis about mean, e.g., mean starting income for graduates from WSU is $25,000. Get random sample, say.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Sociology 601: Class 5, September 15, 2009
What z-scores represent
Evaluating Classifiers Lecture 2 Instructor: Max Welling Read chapter 5.
© sebastian thrun, CMU, The KDD Lab Intro: Outcome Analysis Sebastian Thrun Carnegie Mellon University
Evaluation.
Evaluating Hypotheses
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
Experimental Evaluation
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
Statistical inference Population - collection of all subjects or objects of interest (not necessarily people) Sample - subset of the population used to.
BCOR 1020 Business Statistics
Sampling Distributions & Point Estimation. Questions What is a sampling distribution? What is the standard error? What is the principle of maximum likelihood?
Survey Methodology Sampling error and sample size EPID 626 Lecture 4.
Normal and Sampling Distributions A normal distribution is uniquely determined by its mean, , and variance,  2 The random variable Z = (X-  /  is.
Standard error of estimate & Confidence interval.
Review of normal distribution. Exercise Solution.
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
AM Recitation 2/10/11.
AP Statistics Chapter 9 Notes.
1 Machine Learning: Experimental Evaluation. 2 Motivation Evaluating the performance of learning systems is important because: –Learning systems are usually.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Topic 5 Statistical inference: point and interval estimate
Statistics for Data Miners: Part I (continued) S.T. Balke.
Lecture 14 Dustin Lueker. 2  Inferential statistical methods provide predictions about characteristics of a population, based on information in a sample.
Estimation in Sampling!? Chapter 7 – Statistical Problem Solving in Geography.
General Statistics Ch En 475 Unit Operations. Quantifying variables (i.e. answering a question with a number) Each has some error or uncertainty.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Experimental Evaluation of Learning Algorithms Part 1.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
General Statistics Ch En 475 Unit Operations. Quantifying variables (i.e. answering a question with a number) 1. Directly measure the variable. - referred.
CpSc 881: Machine Learning Evaluating Hypotheses.
Machine Learning Chapter 5. Evaluating Hypotheses
1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.
Review Normal Distributions –Draw a picture. –Convert to standard normal (if necessary) –Use the binomial tables to look up the value. –In the case of.
Sampling and Statistical Analysis for Decision Making A. A. Elimam College of Business San Francisco State University.
Chapter 7: The Distribution of Sample Means. Frequency of Scores Scores Frequency.
Sampling Theory and Some Important Sampling Distributions.
8.1 Estimating µ with large samples Large sample: n > 30 Error of estimate – the magnitude of the difference between the point estimate and the true parameter.
Lecture 8 Estimation and Hypothesis Testing for Two Population Parameters.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
INTRODUCTION TO ECONOMIC STATISTICS Topic 8 Confidence Intervals These slides are copyright © 2010 by Tavis Barr. This work is licensed under a Creative.
Chapter 6 Inferences Based on a Single Sample: Estimation with Confidence Intervals Slides for Optional Sections Section 7.5 Finite Population Correction.
STA 291 Spring 2010 Lecture 12 Dustin Lueker.
STAT 312 Chapter 7 - Statistical Intervals Based on a Single Sample
Chapter 6: Sampling Distributions
Empirical Evaluation (Ch 5)
MATH 2311 Section 4.4.
Evaluating Hypotheses
Sampling Distributions (§ )
Evaluating Hypothesis
The Normal Distribution
STA 291 Summer 2008 Lecture 12 Dustin Lueker.
STA 291 Spring 2008 Lecture 12 Dustin Lueker.
Machine Learning: Lecture 5
MATH 2311 Section 4.4.
Presentation transcript:

Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased – error: error train (h) = #misclassifications/|S train | – error D (h) ≥ error train (h) could set aside a random subset of data for testing – sample error for any finite sample S drawn randomly from D is unbiased, but not necessarily same as true error – err S (h) ≠ err D (h) what we want is estimate of “true” accuracy over distribution D

Confidence Intervals put a bound on error D (h) based on Binomial Distribution – suppose sample error rate is error S (h)=p – then 95% CI for error D (h) is – E[error D (h)] = error S (h) = p – E[var(error D (h))] = p(1-p)/n – standard deviation =  var; var =  2 – 1.96  comes from confidence level (95%)

Binomial Distribution put a bound on error D (h) based on Binomial distribution – suppose true error rate is error D (h)=p – on a sample of size n, would expect np errors on average, but could vary around that due to sampling (error rate, as a proportion:)

Hypothesis Testing is error D (h)<0.2 (is error rate of h less than 20%?) – example: is better than majority classifier? (suppose error maj (h)=20%) if we approximate Binomial as Normal, then  ±2  should bound 95% of likely range for error D (h) two-tailed test: – risk of true error being higher or lower is 5% – Pr[Type I error]≤0.05 restrictions: n≥30 or np(1-p)≥5

Gaussian Distribution  1.28  = 80% of distr. z-score: relative distance of a value x from mean

for a one-tailed test, use z value for  /2 for example suppose error S (h)=0.19 and  suppose you want 95% confidence that error D (h)<20%, then test if 0.2-error S (h)>1.64  1.64 comes from z-score for  =90%

notice that confidence interval on error rate tightens with larger sample sizes – example: compare 2 trials that have 10% error – test set A has 100 examples, h makes 10 errors: 10/100 sqrt(.1x.9/100)=0.03 CI 95% (err(h)) = [10±6%] = [4-16%] – test set B has 100,000 examples, 10,000 errors: 10,000/100,000=sqrt(.1x.9/100,000)= CI 95% (err(h)) = [10±0.19%] = [ %]

Comparing 2 hypotheses (decision trees) – test whether 0 is in conf. interval of difference – add variances example...

Estimating the accuracy of a Learning Algorithm error S (h) is the error rate of a particular hypothesis, which depends on training data what we want is estimate of error on any training set drawn from distribution we could repeat the splitting of data into independent training/testing sets, build and test k decisions trees, and take average note that this is a biased estimator, probably under- estimates true accuracy because uses less examples – this is a disadvantage of CV: building d-trees with only 90% of the data – (and it takes 10 times as long)

k-fold Cross-Validation (typically k=10) partition the dataset D into k subsets of equal size (  30), T 1..T k for i from 1 to k do: S i = D-T i // training set, 90% of D h i = L(S i ) // build decision tree e i = error(h i,T i ) // test d-tree on 10% held out  = (1/k)  e i  =  (1/k)  e i -  ) 2 SE =  k   (1/k(k-1))  e i -  ) 2 CI 95 =  t dof,   SE (t dof,  =2.23 for k=10 and  =95%)

what to do with 10 accuracies from CV? – accuracy of alg is just the mean (1/k)  acc i – for CI, use “standard error” (SE):  =  (1/k)  e i -  ) 2 SE =  k   (1/k(k-1))  e i -  ) 2 standard deviation for estimate of the mean – 95% CI =  ± t dof,   (1/k(k-1))  e i -  ) 2 Central Limit Theorem – we are estimating a “statistic” (parameter of a distribution, e.g. the mean) from multiple trials – regardless of underlying distribution, estimate of the mean approaches a Normal distribution – if std. dev. of underlying distribution is , then std. dev. of mean of distribution is  /  n

example: multiple trials of testing the accuracy of a learner, assuming true acc=70% and  =7% there is intrinsic variability in accuracy between different trials with more trials, distribution converges to underlying (std. dev. stays around 7) but the estimate of the mean (vertical bars,  2  /  n) gets tighter est of true mean= 71.0  2.5 est of true mean= 70.5  0.6 est of true mean=  0.03

Student’s T distribution is similar to Normal distr., but adjusted for small sample size; dof = k-1 example: t 9,0.05  2.23 (Table 5.6)

Comparing 2 Learning Algorithms – e.g. ID3 with 2 different pruning methods approach 1: – run each algorithm 10 times (using CV) independently to get CI for acc of each alg acc(A), SE(A) acc(B), SE(A) – T-test: statistical test if difference means ≠ 0 d=acc(A)-acc(B) – problem: the variance is additive (unpooled)

suppose mean acc for A is 61%  2, mean acc for B is 64%  d=acc(B)-acc(A) mean = 3% SE =  3.7 (just a guess) acc(L A,T i ) acc(L B,T i )d=B-A %63%+3%, SE=1% mean diff is same but B is systematically higher than A

approach 2: Paired T-test – run the algorithms in parallel on the same divisions of tests test whether 0 is in CI of differences: