1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Evaluating Classifiers
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Sampling: Final and Initial Sample Size Determination
Evaluating Hypotheses. Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 2 Some notices Exam can be made in Artificial Intelligence (Department.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Evaluating Classifiers Lecture 2 Instructor: Max Welling Read chapter 5.
Hypergeometric Random Variables. Sampling without replacement When sampling with replacement, each trial remains independent. For example,… If balls are.
Evaluating Hypotheses
Evaluating Classifiers Lecture 2 Instructor: Max Welling.
Experimental Evaluation
Inferences About Process Quality
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
BCOR 1020 Business Statistics
Chapter Nine: Evaluating Results from Samples Review of concepts of testing a null hypothesis. Test statistic and its null distribution Type I and Type.
Section #4 October 30 th Old: Review the Midterm & old concepts 1.New: Case II t-Tests (Chapter 11)
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Today’s lesson Confidence intervals for the expected value of a random variable. Determining the sample size needed to have a specified probability of.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
1 Machine Learning: Experimental Evaluation. 2 Motivation Evaluating the performance of learning systems is important because: –Learning systems are usually.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Statistics for Data Miners: Part I (continued) S.T. Balke.
Statistical Review We will be working with two types of probability distributions: Discrete distributions –If the random variable of interest can take.
Random Sampling, Point Estimation and Maximum Likelihood.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Basic Statistics for Engineers. Collection, presentation, interpretation and decision making. Prof. Dudley S. Finch.
LECTURER PROF.Dr. DEMIR BAYKA AUTOMOTIVE ENGINEERING LABORATORY I.
Experimental Evaluation of Learning Algorithms Part 1.
Learning from Observations Chapter 18 Through
 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 5.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
Biostatistics Unit 5 – Samples. Sampling distributions Sampling distributions are important in the understanding of statistical inference. Probability.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
CHAPTER SEVEN ESTIMATION. 7.1 A Point Estimate: A point estimate of some population parameter is a single value of a statistic (parameter space). For.
CpSc 881: Machine Learning Evaluating Hypotheses.
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师:陈昱 Tel :  助教:程再兴, Tel :  课程网页:
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 29 February 2008 William.
Machine Learning Chapter 5. Evaluating Hypotheses
1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.
Sampling Fundamentals 2 Sampling Process Identify Target Population Select Sampling Procedure Determine Sampling Frame Determine Sample Size.
Review Normal Distributions –Draw a picture. –Convert to standard normal (if necessary) –Use the binomial tables to look up the value. –In the case of.
Experimentally Evaluating Classifiers William Cohen.
Validation methods.
Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased – error:
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Chapter Eleven Sample Size Determination Chapter Eleven.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
Sampling and Sampling Distributions
Virtual University of Pakistan
Evaluating Hypotheses
Chapter 6 Inferences Based on a Single Sample: Estimation with Confidence Intervals Slides for Optional Sections Section 7.5 Finite Population Correction.
STATISTICAL INFERENCE
Chapter 4. Inference about Process Quality
Evaluating Classifiers
Empirical Evaluation (Ch 5)
Evaluating Hypotheses
Instructor : Saeed Shiry
Confidence intervals for the difference between two means: Independent samples Section 10.1.
Evaluating Hypothesis
Machine Learning: Lecture 5
Presentation transcript:

1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)

2 Motivation §Evaluating the performance of learning systems is important because: l Learning systems are usually designed to predict the class of “future” unlabeled data points. l In some cases, evaluating hypotheses is an integral part of the learning process (example, when pruning a decision tree)

3 §Bias in the estimate: The observed accuracy of the learned hypothesis over the training examples is a poor estimator of its accuracy over future examples ==> we test the hypothesis on a test set chosen independently of the training set and the hypothesis. §Variance in the estimate: Even with a separate test set, the measured accuracy can vary from the true accuracy, depending on the makeup of the particular set of test examples. The smaller the test set, the greater the expected variance. Difficulties in Evaluating Hypotheses when only limited data are available

4 Questions Considered §Given the observed accuracy of a hypothesis over a limited sample of data, how well does this estimate its accuracy over additional examples? §Given that one hypothesis outperforms another over some sample data, how probable is it that this hypothesis is more accurate, in general? §When data is limited what is the best way to use this data to both learn a hypothesis and estimate its accuracy?

5 Estimating Hypothesis Accuracy Two Questions of Interest: l Given a hypothesis h and a data sample containing n examples drawn at random according to distribution D, what is the best estimate of the accuracy of h over future instances drawn from the same distribution? ==> sample vs. true error l What is the probable error in this accuracy estimate? ==> confidence intervals

6 Sample Error and True Error §Definition 1: The sample error (denoted error s (h)) of hypothesis h with respect to target function f and data sample S is: error s (h)= 1/n  x  S  (f(x),h(x)) where n is the number of examples in S, and the quantity  (f(x),h(x)) is 1 if f(x)  h(x), and 0, otherwise. §Definition 2: The true error (denoted error D (h)) of hypothesis h with respect to target function f and distribution D, is the probability that h will misclassify an instance drawn at random according to D. error D (h)= Pr x  D [f(x)  h(x)]

7 Confidence Intervals for Discrete-Valued Hypotheses §The general expression for approximate N% confidence intervals for error D (h) is: error S (h)  z N  error S (h)(1-error S (h))/n where Z N is given in [Mitchell, table 5.1] §This approximation is quite good when n error S (h)(1 - error S (h))  5

8 Mean and Variance §Definition 1: Consider a random variable Y that takes on possible values y 1, …, y n. The expected value (or mean value) of Y, E[Y], is: E[Y] =  i=1 n y i Pr(Y=y i ) §Definition 2: The variance of a random variable Y, Var[Y], is:. Var[Y] = E[(Y-E[Y]) 2 ] §Definition 3: The standard deviation of a random variable Y is the square root of the variance.

9 Estimators, Bias and Variance §Since error S (h) (an estimator for the true error) obeys a Binomial distribution (See, [Mitchell, Section 5.3]), we have: error S (h) = r/n and error D (h) = p where n is the number of instances in the sample S, r is the number of instances from S misclassified by h, and p is the probability of misclassifying a single instance drawn from D. §Definition: The estimation bias (  from the inductive bias) of an estimator Y for an arbitrary parameter p is E[Y] - p §The standard deviation for error S (h) is given by  p(1-p)/n   error S (h)(1-error S (h))/n

10 Difference in Error of two Hypotheses §Let h 1 and h 2 be two hypotheses for some discrete- valued target function. H 1 has been tested on a sample S 1 containing n 1 randomly drawn examples, and h 2 has been tested on an independent sample S 2 containing n 2 examples drawn from the same distribution. §Let’s estimate the difference between the true errors of these two hypotheses, d, by computing the difference between the sample errors: d ˆ = error S1 (h1)-error S2 (h2) §The approximate N% confidence interval for d is: d^  Z N  error S1 (h 1 )(1-error S1 (h 1 ))/n 1 + error S2 (h 2 )(1-error S2 (h 2 ))/n 2

11 Comparing Learning Algorithms §Which of L A and L B is the better learning method on average for learning some particular target function f ? §To answer this question, we wish to estimate the expected value of the difference in their errors: E S  D [error D (L A (S))-error D (L B (S))] §Of course, since we have only a limited sample D 0 we estimate this quantity by dividing D 0 into a training set S 0 and a testing set T 0 and measure: error T0 (L A (S 0 ))-error T0 (L B (S 0 )) §Problem: We are only measuring the difference in errors for one training set S 0 rather than the expected value of this difference over all samples S drawn from D Solution: k-fold Cross-Validation

12 k-Fold Cross-Validation 1. Partition the available data D 0 into k disjoint subsets T 1, T 2, …, T k of equal size, where this size is at least For i from 1 to k, do use T i for the test set, and the remaining data for training set S i S i <- {D 0 - T i } h A <- L A (S i ) h B <- L B (S i )  i <- error Ti (h A )-error Ti (h B ) 3. Return the value avg(  ), where. avg(  ) = 1/k  i=1 k  i

13 Confidence of the k-fold Estimate § The approximate N% confidence interval for estimating E S  D0 [error D (L A (S))-error D (L B (S))] using avg(  ), is given by: avg(  )  t N,k-1 s avg(  ) where t N,k-1 is a constant similar to Z N (See [Mitchell, Table 5.6]) and s avg(  ) is an estimate of the standard deviation of the distribution governing avg(  ) s avg(  )) =  1/k(k-1)  i=1 k (  i -avg(  )) 2