Probability theory: (lecture 2 on AMLbook.com)

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Estimation of Means and Proportions
Data mining in 1D: curve fitting
Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.
Visual Recognition Tutorial
1 Bernoulli and Binomial Distributions. 2 Bernoulli Random Variables Setting: –finite population –each subject has a categorical response with one of.
By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik
Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Statistical Inference Chapter 12/13. COMP 5340/6340 Statistical Inference2 Statistical Inference Given a sample of observations from a population, the.
Maximum likelihood (ML) and likelihood ratio (LR) test
Sampling Distributions
Evaluating Hypotheses
Today Today: Chapter 9 Assignment: 9.2, 9.4, 9.42 (Geo(p)=“geometric distribution”), 9-R9(a,b) Recommended Questions: 9.1, 9.8, 9.20, 9.23, 9.25.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Visual Recognition Tutorial
Today Today: Chapter 8 Assignment: 5-R11, 5-R16, 6-3, 6-5, 8-2, 8-8 Recommended Questions: 6-1, 6-2, 6-4, , 8-3, 8-5, 8-7 Reading: –Sections 8.1,
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
BCOR 1020 Business Statistics Lecture 18 – March 20, 2008.
Maximum likelihood (ML)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Chapter 5 Sampling and Statistics Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
Model Inference and Averaging
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Random Sampling, Point Estimation and Maximum Likelihood.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Chapter 7 Point Estimation
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Maximum Likelihood Estimation Methods of Economic Investigation Lecture 17.
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
CpSc 881: Machine Learning Evaluating Hypotheses.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
INTRODUCTION TO Machine Learning 3rd Edition
1 Standard error Estimated standard error,s,. 2 Example 1 While measuring the thermal conductivity of Armco iron, using a temperature of 100F and a power.
Machine Learning Chapter 5. Evaluating Hypotheses
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Point Estimation of Parameters and Sampling Distributions Outlines:  Sampling Distributions and the central limit theorem  Point estimation  Methods.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Machine Learning 5. Parametric Methods.
ES 07 These slides can be found at optimized for Windows)
8.1 Estimating µ with large samples Large sample: n > 30 Error of estimate – the magnitude of the difference between the point estimate and the true parameter.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
CHAPTER 4 ESTIMATES OF MEAN AND ERRORS. 4.1 METHOD OF LEAST SQUARES I n Chapter 2 we defined the mean  of the parent distribution and noted that the.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Estimating Volatilities and Correlations
Al-Imam Mohammad Ibn Saud University Large-Sample Estimation Theory
Maximum Likelihood Estimation
Lecture Slides Elementary Statistics Twelfth Edition
Probability & Statistics Probability Theory Mathematical Probability Models Event Relationships Distributions of Random Variables Continuous Random.
Evaluating Hypotheses
A discriminant function for 2-class problem can be defined as the ratio of class likelihoods g(x) = p(x|C1)/p(x|C2) Derive formula for g(x) when class.
Parametric Methods Berlin Chen, 2005 References:
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Presentation transcript:

Probability theory: (lecture 2 on AMLbook.com) maximum likelihood estimation (MLE) get parameters from data Hoeffding’s inequality (HI) how good are my parameter estimates? Connection to learning HI applies to relationship between Eout(h), Etest(h), and Ein(h)

Maximum Likelihood Parameter Estimation Estimate parameters q of a probability distribution given a sample X drawn from that distribution Example: estimate the mean and variance of a normal distribution Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2

Form the likelihood function Likelihood of q given the sample X l(θ|X) = p (X |θ) = ∏t p(xt|θ) Log likelihood L(θ|X) = log(l(θ|X)) = ∑t log p(xt|θ) Maximum likelihood estimator (MLE) θ* = argmaxθ L(θ|X) the value of θ that maximizes L(θ|X) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3

Example: Bernoulli distribution x = {0,1} x = 0 implies failure x = 1 implies success po = probability of success: parameter to be determined from data p(x) = pox (1 – po ) (1 – x) p(1) = po p(0)= 1 – po p(1) + p(0)= 1 distribution is normalized Given a sample of N trials, show that po = ∑t xt / N = successes/trial is the maximum likelihood estimate of p0 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 4

Since distribution is normalized, MLE can be applied without constraints Log likelihood function L (po|X) = log( ∏t poxt (1 – po ) (1 – xt) ) Solve dL /dp =0 for p0 First step: simply the log-likelihood function

Simply the log-likelihood function L (po|X) = log( ∏t poxt (1 – po ) (1 – xt) ) L (po|X) = St{log(poxt (1 – po ) (1 – xt) )} L (po|X) = St{log(poxt ) + log((1 – po )(1 – xt))} L (po|X) = St{ xtlog(po) + (1 - xt )log(1 – po )} L (po|X) = log( ∏t poxt (1 – po ) (1 – xt) ) L (po|X) = Stlog(poxt (1 – po ) (1 – xt) ) L (po|X) = log(po) St xt + log(1 – po )St(1 - xt )

Take the derivative, set to zero, solve for p0 L/p0 = 1/po St xt- 1/(1 – po )St(1 - xt ) = 0 1/po St xt= 1/(1 – po )St(1 - xt ) ((1 – po )/ po)St xt = St (1 - xt ) = N - St xt St xt = po N St xt / N = po L/p0 = St xt/po - (1 - xt )/(1 – po ) = 0

Similarly for Gaussian distribution p(x) = N ( μ, σ2) MLE for μ and σ2: μ σ Function of a single random variable with a shape characterized by 2 parameters 8 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Pictorial representation of estimating a population mean by a sample mean

Hoeffding’s inequality Confidence in estimate population mean depends on sample size Let d = 2exp(-2e2N) Solve for e(d,N) = sqrt(ln(2/d)/2N) H.I. says that we know |m-n|< e(d,N) with probability at least 1-d

Typical application of Hoeffding’s inequality Suppose N=100 and we want 95% confidence, then = 0.05 and e = 0.14 We have 95% confidence that |(n-m)| < 0.14 Assignment 3, due 9-30-14 Make a table of the bounds that can be placed on the relative error in estimates of population mean m=1 based on a sample with N=100 at confidence levels 90%, 95%, and 99%

For a specific hypothesis h, Eout(h) is analogous to the population mean. Ein(h) is analogous to a sample mean

For a specific hypothesis h, Hoeffding’s inequality applies if the sample on which Ein(h) is evaluated was not used in the selection of h. Usually, h is optimum member of an hypothesis set chosen by application of training data and Ein(h) is evaluated on a test set.

As with any estimate of population mean Choose test-set size N and confidence level 1-d then |Etest - Eout|< e(d,N) = sqrt(ln(2/d)/2N) Since we expect Eout > Etest can be written Eout < Etest + e(d,N)

Test set dilemma To obtain meaningful confidence from Hoeffding’s inequality may require Ntest so large that training is compromised. In chapter 4 of text authors discuss a method to calculate Etest(h) that does not greatly diminish the data available for training.

Review: Probability theory (see lecture 2 on AMLbook.com) maximum likelihood estimation (MLE) get parameters from data Hoeffding’s inequality (HI) how good are my parameter estimates? Connection to learning HI applies to relationship between Eout(h) and Etest(h)

Expansion of Assignment 3, due 9-30-14 1) Make a table of the bounds that can be placed on the relative error in estimates of population mean m=1 based on a sample with N=100 at confidence levels 90%, 95%, and 99% 2) |Etest - Eout|< e(d,N) = sqrt(ln(2/d)/2N) Sponsor requires 98% confidence that e(d,N)=0.1. How large does N have to be to achieve this?

Review: For a specific hypothesis h, Eout(h) is analogous to the population mean. Ein(h) is analogous to a sample mean

HI applies to relationship between Eout(h) and Ein(h) In machine learning we deal with multiple hypotheses. Each h can have a different distribution of correct and incorrect examples in the population and a different in-sample error

Each test of an hypothesis is an independent event One of these events has the smallest Ein(h) and determines the optimum hypothesis g

Modifying Hoeffding inequality for finite hypothesis set Since g is the optimum hypothesis, P|Ein(g) – Eout(g)|> e is likely to be smaller than P|Ein(h) – Eout(h)|> e for most members of the hypothesis set

“union bound” approximation If P|Ein(g) – Eout(g)|> e is smaller than P|Ein(h) – Eout(h)|> e for most members of the hypothesis set, it is certainly smaller than their sum.

Apply Hoeffding inequality to each term in the sum that is the “union bound” This simple result has no practical value because (1) it is a very conservative bound and (2) most hypothesis sets used in machine learning are not finite.

Feasibility of learning Union bound approximation shows “feasibility of learning” for finite hypothesis sets. For any (e,d), P[|Ein(g) – Eout(g)|> e] <d by sufficiently large N

Feasibility of learning For some infinite hypothesis sets, depending on their VG dimension, M can be replaced by a polynomial in N. This shows learning is feasible because polynomials are exponentially dominated.