Parametric Estimation

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

: INTRODUCTION TO Machine Learning Parametric Methods.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Data mining in 1D: curve fitting
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Today Wrap up of probability Vectors, Matrices. Calculus
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Probability theory: (lecture 2 on AMLbook.com)
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
INTRODUCTION TO Machine Learning 3rd Edition
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Machine Learning CUNY Graduate Center Lecture 2: Math Primer.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian classification review Bayesian statistics derive K nearest neighbors (KNN) classifier analysis of 2-way classification results homework assignment.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Lecture 1.31 Criteria for optimal reception of radio signals.
Chapter 7. Classification and Prediction
Chapter 3: Maximum-Likelihood Parameter Estimation
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
LECTURE 03: DECISION SURFACES
CH 5: Multivariate Methods
Maximum Likelihood Estimation
Bias and Variance of the Estimator
Ying shen Sse, tongji university Sep. 2016
INTRODUCTION TO Machine Learning
Covered only ML estimator
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
INTRODUCTION TO Machine Learning
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Machine Learning” Dr. Alper Özpınar.
LECTURE 07: BAYESIAN ESTIMATION
A discriminant function for 2-class problem can be defined as the ratio of class likelihoods g(x) = p(x|C1)/p(x|C2) Derive formula for g(x) when class.
Review for test #2 Fundamentals of ANN Dimensionality reduction
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Test #1 Thursday September 20th
Supervised machine learning: creating a model
Chapter 12: Data Analysis by linear least squares
Presentation transcript:

Parametric Estimation Given X = { xt } where xt is drawn from probability distribution p(x) Properties of p(x) can be discovered by parametric and non parametric techniques In parametric estimation, assume form p (x |q ) (e.g. Gaussian) and estimate the parameters q (e.g. mean and variance) Maximum likelihood estimation: Find q that maximizes likelihood of drawing sample X = { xt } Assume xt are independent and identically distributed (iid). Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 1

Maximum Likelihood Estimation Likelihood of q given the sample X l(θ|X) = p (X |θ) = ∏t p(xt|θ) Log likelihood L(θ|X) = log(l(θ|X)) = ∑t log p(xt|θ) Maximum likelihood estimator (MLE) θ* = argmaxθ L(θ|X) the value of θ that maximizes L(θ|X) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2

Example: Bernoulli distribution Two states, failure & success, x is {0,1} P (x) = pox (1 – po ) (1 – x) po = probability of success L (po|X) = log( ∏t poxt (1 – po ) (1 – xt) ) Show that po = ∑t xt / N = successes/trial Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3

Normalization is built into this distribution if x=1, P(x)=p0 if x=0 P(x) = 1-p0 MLE can be applied without constraints Solve dL /dp =0 for p0 First step: simply the log-likelihood function

L (po|X) = log( ∏t poxt (1 – po ) (1 – xt) ) View in note pages L (po|X) = log( ∏t poxt (1 – po ) (1 – xt) ) L (po|X) = Stlog(poxt (1 – po ) (1 – xt) )

L (po|X) = Stlog(poxt ) + log((1 – po ) (1 – xt) )

L (po|X) = St xtlog(po) + (1 - xt )log(1 – po )

L/p0 = St xt/po - (1 - xt )/(1 – po ) = 0

St xt/po = St (1 - xt )/(1 – po )

(1 – po )/ po St xt = St (1 - xt ) = N - St xt St xt = po N St xt / N = po

Generalize to K>2 states: Multinomial Density Outcome of a random event is one of K mutually exclusive and exhaustive states pi = probability that outcome is state i x is Boolean vector drawn from the distribution probability to draw x = P (x1,x2,...,xK) = ∏i pixi Only one term in product will be different from unity Log likelihood of drawing training set X L(p1,p2,...,pK|X) = log ∏t ∏i pixit Show that MLE is pi = ∑t xit / N Faction of trials when state i is the outcome Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 11

In multinomial case, MLE most be applied subject to normalization constraint Lagrange multipliers are common approach to constrained optimization L(x, l) = 1-x12-x22 +l(x1+x2-1) Partial derivatives of L with respect to x1, x2, and l set to zero -2x1 + l = 0 -2x2 + l = 0 x1 + x2 -1 = 0 Not necessary to find l (why?) l also called “undetermined multiplier”

A simple example constrained optimization using Lagrange multipliers find the stationary point of f(x1, x2) = 1 - x12 – x22 subject to the constraint g(x1, x2) = x1 + x2 - 1 = 0

Form the Lagrangian L(x, l) = 1-x12-x22 +l(x1+x2-1)

-2x1 + l = 0 -2x2 + l = 0 x1 + x2 -1 = 0 Solve for x1 and x2 Set the partial derivatives of L with respect to x1, x2, and l equal to zero L(x, l) = 1-x12-x22 +l(x1+x2-1) -2x1 + l = 0 -2x2 + l = 0 x1 + x2 -1 = 0 Solve for x1 and x2 -2x1 + l = 0 -2x2 + l = 0 x1 + x2 -1 = 0 Solve for x1 and x2

x1* = x2* = ½ In this case, not necessary to find l l sometimes called “undetermined multiplier”

Generalize to K>2 states: Multinomial Density Outcome of random event is one of K mutually exclusive and exhaustive states pi = probability that outcome is state i Log likelihood of drawing training set X of size N L(p1,p2,...,pK|X) = log ∏t ∏i pixit Show that MLE is pi = ∑t xit / N First step: simplify log-likelihood function Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 17

MLE of multinomial distribution by Lagrange multipliers Probability to draw x = P (x1,x2,...,xK) = ∏iK pixi L(p1,p2,...,pK|X) = log ∏t ∏i pixit

MLE by Lagrange multipliers continued xkt is Boolean

1D Gaussian Distribution p(x) = N ( μ, σ2) MLE for μ and σ2: μ σ Function of a single random variable with a shape characterized by 2 parameters 20 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

z is normally distributed with zero mean and unit variance Pseudo-code for sampling a Gaussian distribution with specified mean and variance z is normally distributed with zero mean and unit variance Find a library function for random numbers drawn from p(z) Given a random number zi from this distribution, xi = s zi + m is a random number with the desired characteristics

Gaussian Parametric Classification Define a discriminant function using Bayes’ rule with class likelihoods that are Gaussian distributed posterior likelihood prior evidence First step: Take log of P(C|x) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 22

We can drop the term log(p(x)) Why? Next step?

Substitute the log of Gaussian class likelihood

Given a 1D multi-class dataset and discriminant function How do we use this discriminant to classify an object with attribute x? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 25

Given the value of attribute x, calculate gi(x) for all of classes Assign the object to the class with largest gi(x) Before this procedure can be followed, we must have estimators for mean, variance, and prior of each class

Estimate prior, mean, and variance of all classes xt is a scalar, rt is Boolean vector Use rit to pick out class i examples in sums over whole dataset MLE of prior is the fraction of examples in class i mi and si2 are class-specific estimators Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27

Use MLE results to construct class discriminants

At boundary most probable class changes Single boundary at Example for 1D 2-class problem Equal variances and priors Between + 2 transition between prediction of class At boundary most probable class changes Example: 2 classes Class likelihoods have means + 2 and equal variance Priors are also equal. Between -2 and 2 have transition between essentially certain classification of classes as a function of x Single boundary at halfway between means where normalized posteriors are equal to 0.5 29 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Red class likelihood dominant for x < about -7 also Variances are different Boundaries where dominant posterior changes are called “Bayes discriminant points” 30 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Assignment 2: Due 9/20/12 Use the equality of discriminants to derive a quadratic equation for Bayes’ discriminant points in a 1D, 2-class problem with Gaussian class likelihoods Mean and variance of C1 are 3 and 1, respectively Mean and variance of C2 are 2 and 0.3, respectively Prior are equal With a sample size of 100, compare the MLE estimators to the true means and variances. For the same sample, compare Bayes’ discriminant points calculated from MLE estimators with those derived from the true means and variances.

For a 1D, 2-class problem with Gaussian class likelihoods, derive the functional form of P(C1|x) when the following are true: (1) variances and priors are equal, (2) posteriors are normalized Hint: start with the ratio of posteriors to eliminate priors and evidence posterior likelihood prior evidence

With equal priors P(C1|x)/P(C2|x) = p(x|C1)/p(x|C2) = f(x) How do we derive f(x)?

f(x) = p(x|C1)/p(x|C2) = N(m1, s1)/N(m2, s2) s1 = s2 = s f(x) = exp(-(x - m1)2/2s2)/exp(-(x - m2)2/2s2) How do we simplify this expression?

Combine exponents and simplify Why did the quadratic term cancel? P(C1|x)/P(C2|x) = p(x|C1)/p(x|C2) = f(x) Given f(x) how do we get P(C1|x)?

Use normalization P(C2|x) = (1 - P(C1|x)) P(C1|x)/(1 - P(C1|x) = f(x); Solve for P(C1|x) y = wx+w0 P(C1|x) = sigmoid(y) domain of class 1 P(C1|x)>0.5

Most common model of output from a neuron with weighted connections to input is y = wx + w0 = wTx x s s = sigmoid(y) 1 Bias node Training this perceptron dichotomizer optimizes weights to switch s from values near 1 to values near zero when attributes x change from those of class members to those of non-class members

Contrast between parametric and non-parametric methods Parametric: use the discriminant function to assign class Evaluate Given estimators of mean and variance from MLE All based on assumption of Gaussian class likelihoods

y = wx + w0 = wTx s w0 s = sigmoid(y) w x Contrast between parametric and non-parametric methods Non-parametric: use the same discriminant function with parameters determined from data w w0 y = wx + w0 = wTx x s s = sigmoid(y) Some optimization procedure must replace MLE. For ANNs we use back propagation most often

Regression Only restriction on data for regression is that the label (also called output) be a real number. Input vector can have components that are mixtures of real numbers and Booleans Input should contain all of the attributes (also called predictors) that influence the value of the label. To develop the theory of parametric regression, we treat the case of scalar input (i.e. one real number) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 40

Regression: scalar input Output (label) assumed to be an unknown function of the input + noise that is normally distributed with zero mean and variance s2 r = f(x) + e Since the distribution of e has mean of zero p(r|x) = N(f(x),s2) Seek an estimator of the mean N(f(x),s2) with parameters q to be determined by MLE Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 41

Maximum likelihood estimation of g(x,q) Log Likelihood How do we simplify this log-likelihood function? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 42

Product over t becomes sum over t Next?

2 terms in the sum How do we evaluate this sum?

How do we choose q to maximize this log-likelihood function?

Maximize likelihood by minimizing the sum of squared residuals This optimization criterion is a consequence of the assumption of Gaussian distributed noise

Regression by linear least squares Assume g(x|q) is a linear combination of n known functions (1, x, x2, …xn-1, for example) m = number of examples (xit, rit) in training set t Define mxn matrix A Aij = jth function in evaluated at xit q column vector of n unknown coefficients b column vector of m rit in training set If Aq = b has a solution, g(xit|q) = rit for all i Not what we want, why? with n << m, Aq = b has no exact solution 47 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Look for an approximate solution which minimizes a norm of Normal Equations Look for an approximate solution which minimizes a norm of the residual vector r = b – Aq Choosing the Euclidean norm, we minimize the sum of squared residuals define f(q) = ||r||2 = rTr f(q) = (b – Aq)T(b – Aq) = bTb –2qTATb + qTATAq A necessary condition for q0 to be minimum of f(q) is f(q0) = o f(q) = 2ATAq – 2ATb optimal set of parameters is a solution of nxn symmetric system of linear equations ATAq = ATb 48 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Polynomial Regression Vandermonde matrix for polynomial of degree k fit to N data points. Solve DTDx = DTr for k+1 coefficients Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Coefficient of determination Denominator is the sum of squared error when predictors are ignored. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 50

Tuning regression models Create M in silico datasets of size N Use MLE to find M estimators gi(x) Average gi(x) to get best overall estimator Calculate bias and variance of best estimator best estimator f(x) is known, why? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 51

Bias-Variance Dilemma As we increase model complexity (more free parameter q determined by minimizing) bias decreases (a better fit to given dataset) but variance increases (fit varies with datasets) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 52

f f gi g sin(x) + noise bias one in silico experiment Linear regression: 5 experiments Each cubic has shape like f(x) Shape of estimators vary Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 53

Mean square error = bias2 + variance Bias decreases with degree of polynomial fit Variance increases with degree of polynomial fit Best complexity Beyond 3 decreases in bias offset of increases in variance Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 54

Analysis of polynomial fit to real datasets This analysis is not possible because we cannot calculate the bias, why?

“elbow” divide real data into training and validation sets Compare training and validation errors as function of complexity Shape of validation error is best indicator of optimum complexity What is “validation error”? “elbow” Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 56

Regularization: balance bias and variance Penalize complex models with small bias and large variance by adding a term to the error function E’=error on data + λ (metric of model complexity) Value of l is not part of q parameterization Best value of l determined by experience Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 57

Regularization in Polynomial Fitting By penalizing large weights we get better results for higher-order polynomials w0 = 0 in all cases, why? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 58

Gaussian Parametric Classification Define a discriminant function using posterior ignoring class-independent normalization Assume class likelihoods are Gaussian Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 60

Bias and Variance of Estimators For datasets Xi drawn from a distribution with unknown parameter q, estimate of mean di is a random variable. Bias: bq (d) = E[d] – q Variance: E[(d–E[d])2] Mean square error: r (d,q) = E [(d–q)2] = (E[d] – q)2 + E[(d–E[d])2] = Bias2 + Variance q For real datasets, bias as cannot be calculated because q is unknown. In silico data adds noise to examples drawn from a distribution with specified q. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 61

Derive pi = ∑t xit / N for a multinomial sample Likelihood of q given the sample X l (θ|X) = p (X |θ) = ∏t p (xt|θ) Log likelihood L(θ|X) = log l (θ|X) = ∑t log p (xt|θ) Maximum likelihood estimator (MLE) θ* = argmaxθ L(θ|X) To apply with constraints, need Lagrange multipliers

Bias and Variance - bias variance Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 63

Estimating Bias and Variance M samples Xi={xti , rti}, i=1,...,M are used to fit gi (x), i =1,...,M Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 64

Bayesian Model Selection Prior on models, p(model) Regularization, when prior favors simpler models Bayes, MAP of the posterior, p(model|data) Average over a number of models with high posterior (voting, ensembles: Chapter 17) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 65

More about Bayes’ Estimator Ignoring priors the Maximum Likelihood (ML) estimate of q is θML = argmaxθ p(X|θ) Given posteriors we have other estimates Maximum a Posteriori (MAP): θMAP = argmaxθ p(θ|X) Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ By integrating over posteriors we can express unknown density in terms of sample p(x|X) = ∫ p(x|θ) p(θ|X) dθ suggesting non-parametric approaches Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 66

Parametric Classification Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 67

Error, Bias, and Variance: Let X = {xt, rt} be samples drawn from joint distribution p(x, r) For each sample construct estimator g(x) and average <g(x)> Expected square error at x, E[(r-g(x))2|x], has contributions from noise and error in estimators. Due to noise, expected square error cannot be zero Expected square error in estimators can be written as the sum of squared bias = (E(r|x)-<g(x)>)2 and variance = EX[(g(x)-<g(x)>)2] Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 68