机器学习陈昱北京大学计算机科学技术研究所信息安全工程研究中心. 课程基本信息  主讲教师：陈昱 Tel ： 82529680  助教：程再兴， Tel ： 62763742  课程网页：

机器学习陈昱北京大学计算机科学技术研究所信息安全工程研究中心

课程基本信息  主讲教师：陈昱 chen_yu@pku.edu.cn Tel ： 82529680  助教：程再兴， Tel ： 62763742 wataloo@hotmail.com  课程网页： http://www.icst.pku.edu.cn/course/jiqix uexi/jqxx2011.mht 2

Ch5 Evaluating Hypotheses 1. Given observed accuracy of hypothesis over limited sample of data, how well does this estimate its accuracy over additional sample? (hypothesis accuracy) 2. Given that hypothesis h outperforms h’ over some sample of data, how probable is that h outperforms h’ in general? (difference between hypotheses) 3. When data is limited, what is best way to use the data to both learn a hypothesis and estimate its accuracy? (comparing learning algorithms) 3

Agenda  Estimating hypothesis accuracy  Basics of sampling theory  Deriving confidence interval (general approach)  Difference between hypotheses  Comparing learning algorithm 4

Learning Problem Setting  Space of possible instances X (e.g. set of all people) over which target functions may be defined.  Assume that different instances in X may be encountered with different frequencies.  Modeling above assumption as: unknown probability distribution D that defines the probability of encountering each instance in X  Training examples are provided by drawing instances independently from X, according to D. 5

Bias & Variance  In case of limited data, when we try to estimate the accuracy of a learned hypothesis, two difficulties arise: Bias ： The training examples typically provide an optimistically biased estimate of accuracy of learned hypo over future examples (overfitting problem) Variance: Even if the hypo accuracy is measured over an unbiased set of testing examples, the makeup of testing set could still effect the measurement of the accuracy of learned hypo 6

Qs in Focus 1. Given a hypo h and a data sample containing n examples drawing at random according to distribution D, what is best estimate of accuracy of h over future instances drawn from D? 2. What is probable error in above estimate? 8

Sample Error & True Error  Sample error of hypo h w.r.t. target function f and data set S of n sample is  True error of hypo h w.r.t. target function f and distribution D is  So the two Qs become: How well error S (h) estimates error D (h)? 9

Confidence Interval for Discrete-Valued Hypo  Assume sample S contains n examples drawn independent of another, and independent of h, according to distribution D, and n ≧ 30  Then given no other information, the most probable value of error D (h) is error S (h); Furthermore, with approximately 95% probability, error D (h) lie in 10

Binomial Probability Distribution  Probability P(r) of r heads in n coin flips, given Pr(head in one flip)=p:  Expected value of binomial distribution X=b(n,p) is: E[X]=np  Variance of X is Var(X)=np(1-p)  Standard deviation of X is σ Χ =sqrt(np(1-p)) 12

Example  Remark: Bell-shape figure 13

Compute error S (h)  Assume h misclassified r sample from set S of n samples, then 14

Normal Distribution  80% of area of probability density function N(μ,σ) lies in μ±1.28σ  N% of area of probability density function N(μ,σ) lies in μ±z N σ 15

Approximation of error S (h)  When n is large enough, error S (h) can be approximated by Normal distribution with same expected value & variance, i.e. N(error D (h), error D (h)(1-error D (h))/n) (Corollary of Central Limit Theorem) The rule of thumb is that, n≥30, or n ×error D (h)(1-error D (h))≥5 16

Confidence Interval of Estimation of error D (h)  It follows that with around N% probability, error S (h) lies in interval error D (h)±z N sqrt[error D (h)(1-error D (h))/n]  Equivalently, error D (h) lies in interval error S (h)±z N sqrt[error D (h)(1-error D (h))/n], which can be approximated by ( 贝努里大数定律 ) error S (h)±z N sqrt[error S (h)(1-error S (h))/n]  Therefore we have derived confidence interval for discrete-valued hypo 17

Two-Sided & One-Sided Bounds  Sometimes it is desirable to transfer two-sided bound into one-sided bound, for example, when we are interested in Q “What is probability that error D (h) is at most U (certain upper bound)?  Transfer two-sided bound into one- sided bound using symmetry of normal distribution (fig 5.1 in textbook) 18

Qs in Focus 1. Given a hypo h and a data sample containing n examples drawing at random according to distribution D, what is best estimate of accuracy of h over future instances drawn from D? A: Prefer unbiased estimator with minimum variance 2. What is probable error in above estimate? A: Derive confidence interval 19

General Approach 1. Pick up parameter p to be estimated e.g. error D (h) 2. Choose an estimator, desirable unbiased plus minimum variance e.g. error S (h) with large n 3. Determine probability distribution that governs estimator 4. Find interval (L,U) such that N% of probability mass falls in the interval 21

Central Limit Theorem  Consider a set of independent, identically distributed (i.i.d) random variable Y 1 …Y n, all governed by an arbitrary probability distribution D with mean μ and finite variance σ 2. Define the sample mean  Central Limit Theorem: As n → ∞, the distribution governing approaches N(μ, σ 2 /n). 22

Approximate error S (h) by Normal Distribution  In Central Limit Theorem take distribution D to be Bernoulli experiment with p to be error D (h), and we are done! 23

Ch2 Evaluating Hypotheses 1. Given observed accuracy of hypothesis over limited sample of data, how well does this estimate its accuracy over additional sample? (hypothesis accuracy, done!) 2. Given that hypothesis h outperforms h’ over some sample, how probable is that h outperforms h’ in general? (difference between hypotheses, this section) 3. When data is limited, what is best way to use the data to both learn a hypothesis and estimate its accuracy? (comparing learning algorithms) 25

Difference in Error Test h 1 on sample S 1, test h 2 on S 2 1. Pick up parameter to be estimated: d ≡ error D (h 1 )- error D (h 2 ) 2. Choose an estimator Property of  Unbiased estimator  When n is large enough e.g. ≧ 30, it can be approximated by difference of two Normal distribution, also a normal distribution, with mean=d, and in case that these two tests are independent, var=var(error S1 (h 1 ))+var(error S2 (h 2 )). 3. …… 26

Difference in Error (2)  Remark: when S 1 =S 2, the estimator usually becomes smaller (elimination of difference in composition of two sample sets) 27

Hypothesis Tesing  Consider question “What is the probability that error D (h 1 ) ≧ error D (h 2 )” instead  E.g. S 1, S 2 of size 100, error S1 (h 1 )=0.3, error S2 (h 2 )=0.2, hence  Pr(d>0) is equivalent to one-sided interval  1.64σ corresponds to a two-sided interval with confidence level 90%, i.e. one-sided interval with confidence level 95%. 28

Ch2 Evaluating Hypotheses 1. Given observed accuracy of hypothesis over limited sample of data, how well does this estimate its accuracy over additional sample? (hypothesis accuracy) 2. Given that hypothesis h outperforms h’ over some sample, how probable is that h outperforms h’ in general? (difference between hypotheses) When data is limited, what is best way to use the data to both learn a hypothesis and estimate its accuracy? (comparing learning algorithms) 30

Qs in Focus Let L A and L B be two learning algorithms  What is an appropriate test for comparing L A and L B ?  How can we determine whether an observed difference is statistically significant? 31

Statement of Problem We want to estimate where L(S) is the hypothesis output by learner L using training set S. Remark: The difference of errors is averaged over all training set of size n randomly drawn from D In practice, given limited data D 0, what is a good estimator?  Partition D 0 into training set S 0 and testing set T 0, and measure  Ever better, repeat above many times and average the results 32

Procedure 1. Partition D 0 into k disjoint subsets T 1, T 2, …, T k of equal size of at least 30. 2. For i from 1 to k, do use T i for testing S i ← {D 0 -T i } h A ← L A (S i ) h B ← L B (S i ) δ i ← error Ti (h A )-error Ti (h B ) 3. Return average of δ i as the estimation 33

Estimator  The approximate N% confidence interval for estimating d using is given by 34

Paired t Tests  To understand justification for confidence level given in previous page, consider the following estimation problem: We are given observed values of a set of i.i.d random variables Y 1, Y 2, …, Y k. Wish to estimate expected value of these Y i Use sample mean as the estimator 35

Problem with Limited Data D 0  δ 1 … δ k are not i.i.d, because they are based on overlapping sets of training examples drawn from D 0 rather than full distribution D.  View the algorithm in page 33 as producing estimation for instead. 36

HW  5.4 (10pt, Due Monday, 10-24)  5.6 (10pt, Due Monday, 10-24) 37

机器学习陈昱北京大学计算机科学技术研究所信息安全工程研究中心. 课程基本信息  主讲教师：陈昱 Tel ： 82529680  助教：程再兴， Tel ： 62763742  课程网页：

Similar presentations

Presentation on theme: "机器学习陈昱北京大学计算机科学技术研究所信息安全工程研究中心. 课程基本信息  主讲教师：陈昱 Tel ： 82529680  助教：程再兴， Tel ： 62763742  课程网页："— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师：陈昱 Tel ： 82529680  助教：程再兴， Tel ： 62763742  课程网页：

Similar presentations

Presentation on theme: "机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师：陈昱 Tel ： 82529680  助教：程再兴， Tel ： 62763742  课程网页："— Presentation transcript:

Similar presentations

About project

Feedback

机器学习陈昱北京大学计算机科学技术研究所信息安全工程研究中心. 课程基本信息  主讲教师：陈昱 Tel ： 82529680  助教：程再兴， Tel ： 62763742  课程网页：

Presentation on theme: "机器学习陈昱北京大学计算机科学技术研究所信息安全工程研究中心. 课程基本信息  主讲教师：陈昱 Tel ： 82529680  助教：程再兴， Tel ： 62763742  课程网页："— Presentation transcript: