Presentation is loading. Please wait.

Presentation is loading. Please wait.

Probability theory: (lecture 2 on AMLbook.com)

Similar presentations


Presentation on theme: "Probability theory: (lecture 2 on AMLbook.com)"— Presentation transcript:

1 Probability theory: (lecture 2 on AMLbook.com)
maximum likelihood estimation (MLE) get parameters from data Hoeffding’s inequality (HI) how good are my parameter estimates? Connection to learning HI applies to relationship between Eout(h), Etest(h), and Ein(h)

2 Maximum Likelihood Parameter Estimation
Estimate parameters q of a probability distribution given a sample X drawn from that distribution Example: estimate the mean and variance of a normal distribution Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2

3 Form the likelihood function
Likelihood of q given the sample X l(θ|X) = p (X |θ) = ∏t p(xt|θ) Log likelihood L(θ|X) = log(l(θ|X)) = ∑t log p(xt|θ) Maximum likelihood estimator (MLE) θ* = argmaxθ L(θ|X) the value of θ that maximizes L(θ|X) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3

4 Example: Bernoulli distribution
x = {0,1} x = 0 implies failure x = 1 implies success po = probability of success: parameter to be determined from data p(x) = pox (1 – po ) (1 – x) p(1) = po p(0)= 1 – po p(1) + p(0)= 1 distribution is normalized Given a sample of N trials, show that po = ∑t xt / N = successes/trial is the maximum likelihood estimate of p0 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 4

5 Since distribution is normalized,
MLE can be applied without constraints Log likelihood function L (po|X) = log( ∏t poxt (1 – po ) (1 – xt) ) Solve dL /dp =0 for p0 First step: simply the log-likelihood function

6 Simply the log-likelihood function
L (po|X) = log( ∏t poxt (1 – po ) (1 – xt) ) L (po|X) = St{log(poxt (1 – po ) (1 – xt) )} L (po|X) = St{log(poxt ) + log((1 – po )(1 – xt))} L (po|X) = St{ xtlog(po) + (1 - xt )log(1 – po )} L (po|X) = log( ∏t poxt (1 – po ) (1 – xt) ) L (po|X) = Stlog(poxt (1 – po ) (1 – xt) ) L (po|X) = log(po) St xt + log(1 – po )St(1 - xt )

7 Take the derivative, set to zero, solve for p0
L/p0 = 1/po St xt- 1/(1 – po )St(1 - xt ) = 0 1/po St xt= 1/(1 – po )St(1 - xt ) ((1 – po )/ po)St xt = St (1 - xt ) = N - St xt St xt = po N St xt / N = po L/p0 = St xt/po - (1 - xt )/(1 – po ) = 0

8 Similarly for Gaussian distribution
p(x) = N ( μ, σ2) MLE for μ and σ2: μ σ Function of a single random variable with a shape characterized by 2 parameters 8 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

9 Pictorial representation of estimating a population mean
by a sample mean

10 Hoeffding’s inequality
Confidence in estimate population mean depends on sample size Let d = 2exp(-2e2N) Solve for e(d,N) = sqrt(ln(2/d)/2N) H.I. says that we know |m-n|< e(d,N) with probability at least 1-d

11 Typical application of Hoeffding’s inequality
Suppose N=100 and we want 95% confidence, then = 0.05 and e = 0.14 We have 95% confidence that |(n-m)| < 0.14 Assignment 3, due Make a table of the bounds that can be placed on the relative error in estimates of population mean m=1 based on a sample with N=100 at confidence levels 90%, 95%, and 99%

12 For a specific hypothesis h, Eout(h) is analogous to
the population mean. Ein(h) is analogous to a sample mean

13 For a specific hypothesis h, Hoeffding’s inequality
applies if the sample on which Ein(h) is evaluated was not used in the selection of h. Usually, h is optimum member of an hypothesis set chosen by application of training data and Ein(h) is evaluated on a test set.

14 As with any estimate of population mean
Choose test-set size N and confidence level 1-d then |Etest - Eout|< e(d,N) = sqrt(ln(2/d)/2N) Since we expect Eout > Etest can be written Eout < Etest + e(d,N)

15 Test set dilemma To obtain meaningful confidence from Hoeffding’s inequality may require Ntest so large that training is compromised. In chapter 4 of text authors discuss a method to calculate Etest(h) that does not greatly diminish the data available for training.

16 Review: Probability theory
(see lecture 2 on AMLbook.com) maximum likelihood estimation (MLE) get parameters from data Hoeffding’s inequality (HI) how good are my parameter estimates? Connection to learning HI applies to relationship between Eout(h) and Etest(h)

17 Expansion of Assignment 3, due 9-30-14
1) Make a table of the bounds that can be placed on the relative error in estimates of population mean m=1 based on a sample with N=100 at confidence levels 90%, 95%, and 99% 2) |Etest - Eout|< e(d,N) = sqrt(ln(2/d)/2N) Sponsor requires 98% confidence that e(d,N)=0.1. How large does N have to be to achieve this?

18 Review: For a specific hypothesis h, Eout(h) is
analogous to the population mean. Ein(h) is analogous to a sample mean

19 HI applies to relationship between Eout(h) and Ein(h)
In machine learning we deal with multiple hypotheses. Each h can have a different distribution of correct and incorrect examples in the population and a different in-sample error

20 Each test of an hypothesis is an independent event
One of these events has the smallest Ein(h) and determines the optimum hypothesis g

21 Modifying Hoeffding inequality for finite hypothesis set
Since g is the optimum hypothesis, P|Ein(g) – Eout(g)|> e is likely to be smaller than P|Ein(h) – Eout(h)|> e for most members of the hypothesis set

22 “union bound” approximation
If P|Ein(g) – Eout(g)|> e is smaller than P|Ein(h) – Eout(h)|> e for most members of the hypothesis set, it is certainly smaller than their sum.

23 Apply Hoeffding inequality to each term in the sum
that is the “union bound” This simple result has no practical value because (1) it is a very conservative bound and (2) most hypothesis sets used in machine learning are not finite.

24 Feasibility of learning
Union bound approximation shows “feasibility of learning” for finite hypothesis sets. For any (e,d), P[|Ein(g) – Eout(g)|> e] <d by sufficiently large N

25 Feasibility of learning
For some infinite hypothesis sets, depending on their VG dimension, M can be replaced by a polynomial in N. This shows learning is feasible because polynomials are exponentially dominated.

26


Download ppt "Probability theory: (lecture 2 on AMLbook.com)"

Similar presentations


Ads by Google