Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parametric Estimation

Similar presentations


Presentation on theme: "Parametric Estimation"— Presentation transcript:

1 Parametric Estimation
Given X = { xt } where xt is drawn from probability distribution p(x) Properties of p(x) can be discovered by parametric and non parametric techniques In parametric estimation, assume form p (x |q ) (e.g. Gaussian) and estimate the parameters q (e.g. mean and variance) Maximum likelihood estimation: Find q that maximizes likelihood of drawing sample X = { xt } Assume xt are independent and identically distributed (iid). Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 1

2 Maximum Likelihood Estimation
Likelihood of q given the sample X l(θ|X) = p (X |θ) = ∏t p(xt|θ) Log likelihood L(θ|X) = log(l(θ|X)) = ∑t log p(xt|θ) Maximum likelihood estimator (MLE) θ* = argmaxθ L(θ|X) the value of θ that maximizes L(θ|X) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2

3 Example: Bernoulli distribution
Two states, failure & success, x is {0,1} P (x) = pox (1 – po ) (1 – x) po = probability of success L (po|X) = log( ∏t poxt (1 – po ) (1 – xt) ) Show that po = ∑t xt / N = successes/trial Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3

4 Normalization is built into this distribution
if x=1, P(x)=p0 if x=0 P(x) = 1-p0 MLE can be applied without constraints Solve dL /dp =0 for p0 First step: simply the log-likelihood function

5 L (po|X) = log( ∏t poxt (1 – po ) (1 – xt) )
View in note pages L (po|X) = log( ∏t poxt (1 – po ) (1 – xt) ) L (po|X) = Stlog(poxt (1 – po ) (1 – xt) )

6 L (po|X) = Stlog(poxt ) + log((1 – po ) (1 – xt) )

7 L (po|X) = St xtlog(po) + (1 - xt )log(1 – po )

8 L/p0 = St xt/po - (1 - xt )/(1 – po ) = 0

9 St xt/po = St (1 - xt )/(1 – po )

10 (1 – po )/ po St xt = St (1 - xt ) = N - St xt
St xt = po N St xt / N = po

11 Generalize to K>2 states: Multinomial Density
Outcome of a random event is one of K mutually exclusive and exhaustive states pi = probability that outcome is state i x is Boolean vector drawn from the distribution probability to draw x = P (x1,x2,...,xK) = ∏i pixi Only one term in product will be different from unity Log likelihood of drawing training set X L(p1,p2,...,pK|X) = log ∏t ∏i pixit Show that MLE is pi = ∑t xit / N Faction of trials when state i is the outcome Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 11

12 In multinomial case, MLE most be applied
subject to normalization constraint Lagrange multipliers are common approach to constrained optimization L(x, l) = 1-x12-x22 +l(x1+x2-1) Partial derivatives of L with respect to x1, x2, and l set to zero -2x1 + l = 0 -2x2 + l = 0 x1 + x2 -1 = 0 Not necessary to find l (why?) l also called “undetermined multiplier”

13 A simple example constrained optimization using Lagrange multipliers
find the stationary point of f(x1, x2) = 1 - x12 – x22 subject to the constraint g(x1, x2) = x1 + x2 - 1 = 0

14 Form the Lagrangian L(x, l) = 1-x12-x22 +l(x1+x2-1)

15 -2x1 + l = 0 -2x2 + l = 0 x1 + x2 -1 = 0 Solve for x1 and x2
Set the partial derivatives of L with respect to x1, x2, and l equal to zero L(x, l) = 1-x12-x22 +l(x1+x2-1) -2x1 + l = 0 -2x2 + l = 0 x1 + x2 -1 = 0 Solve for x1 and x2 -2x1 + l = 0 -2x2 + l = 0 x1 + x2 -1 = 0 Solve for x1 and x2

16 x1* = x2* = ½ In this case, not necessary to find l
l sometimes called “undetermined multiplier”

17 Generalize to K>2 states: Multinomial Density
Outcome of random event is one of K mutually exclusive and exhaustive states pi = probability that outcome is state i Log likelihood of drawing training set X of size N L(p1,p2,...,pK|X) = log ∏t ∏i pixit Show that MLE is pi = ∑t xit / N First step: simplify log-likelihood function Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 17

18 MLE of multinomial distribution by Lagrange multipliers
Probability to draw x = P (x1,x2,...,xK) = ∏iK pixi L(p1,p2,...,pK|X) = log ∏t ∏i pixit

19 MLE by Lagrange multipliers continued
xkt is Boolean

20 1D Gaussian Distribution
p(x) = N ( μ, σ2) MLE for μ and σ2: μ σ Function of a single random variable with a shape characterized by 2 parameters 20 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

21 z is normally distributed with zero mean and unit variance
Pseudo-code for sampling a Gaussian distribution with specified mean and variance z is normally distributed with zero mean and unit variance Find a library function for random numbers drawn from p(z) Given a random number zi from this distribution, xi = s zi + m is a random number with the desired characteristics

22 Gaussian Parametric Classification
Define a discriminant function using Bayes’ rule with class likelihoods that are Gaussian distributed posterior likelihood prior evidence First step: Take log of P(C|x) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 22

23 We can drop the term log(p(x))
Why? Next step?

24 Substitute the log of Gaussian class likelihood

25 Given a 1D multi-class dataset
and discriminant function How do we use this discriminant to classify an object with attribute x? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 25

26 Given the value of attribute x, calculate gi(x) for all of classes
Assign the object to the class with largest gi(x) Before this procedure can be followed, we must have estimators for mean, variance, and prior of each class

27 Estimate prior, mean, and variance of all classes
xt is a scalar, rt is Boolean vector Use rit to pick out class i examples in sums over whole dataset MLE of prior is the fraction of examples in class i mi and si2 are class-specific estimators Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27

28 Use MLE results to construct class discriminants

29 At boundary most probable class changes Single boundary at
Example for 1D 2-class problem Equal variances and priors Between + 2 transition between prediction of class At boundary most probable class changes Example: 2 classes Class likelihoods have means + 2 and equal variance Priors are also equal. Between -2 and 2 have transition between essentially certain classification of classes as a function of x Single boundary at halfway between means where normalized posteriors are equal to 0.5 29 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

30 Red class likelihood dominant for x < about -7 also
Variances are different Boundaries where dominant posterior changes are called “Bayes discriminant points” 30 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

31 Assignment 2: Due 9/20/12 Use the equality of discriminants to derive a quadratic equation for Bayes’ discriminant points in a 1D, 2-class problem with Gaussian class likelihoods Mean and variance of C1 are 3 and 1, respectively Mean and variance of C2 are 2 and 0.3, respectively Prior are equal With a sample size of 100, compare the MLE estimators to the true means and variances. For the same sample, compare Bayes’ discriminant points calculated from MLE estimators with those derived from the true means and variances.

32 For a 1D, 2-class problem with Gaussian class likelihoods,
derive the functional form of P(C1|x) when the following are true: (1) variances and priors are equal, (2) posteriors are normalized Hint: start with the ratio of posteriors to eliminate priors and evidence posterior likelihood prior evidence

33 With equal priors P(C1|x)/P(C2|x) = p(x|C1)/p(x|C2) = f(x) How do we derive f(x)?

34 f(x) = p(x|C1)/p(x|C2) = N(m1, s1)/N(m2, s2)
s1 = s2 = s f(x) = exp(-(x - m1)2/2s2)/exp(-(x - m2)2/2s2) How do we simplify this expression?

35 Combine exponents and simplify
Why did the quadratic term cancel? P(C1|x)/P(C2|x) = p(x|C1)/p(x|C2) = f(x) Given f(x) how do we get P(C1|x)?

36 Use normalization P(C2|x) = (1 - P(C1|x))
P(C1|x)/(1 - P(C1|x) = f(x); Solve for P(C1|x) y = wx+w0 P(C1|x) = sigmoid(y) domain of class 1 P(C1|x)>0.5

37 Most common model of output from a neuron with weighted connections to input is
y = wx + w0 = wTx x s s = sigmoid(y) 1 Bias node Training this perceptron dichotomizer optimizes weights to switch s from values near 1 to values near zero when attributes x change from those of class members to those of non-class members

38 Contrast between parametric and non-parametric methods
Parametric: use the discriminant function to assign class Evaluate Given estimators of mean and variance from MLE All based on assumption of Gaussian class likelihoods

39 y = wx + w0 = wTx s w0 s = sigmoid(y) w x
Contrast between parametric and non-parametric methods Non-parametric: use the same discriminant function with parameters determined from data w w0 y = wx + w0 = wTx x s s = sigmoid(y) Some optimization procedure must replace MLE. For ANNs we use back propagation most often

40 Regression Only restriction on data for regression is that the label (also called output) be a real number. Input vector can have components that are mixtures of real numbers and Booleans Input should contain all of the attributes (also called predictors) that influence the value of the label. To develop the theory of parametric regression, we treat the case of scalar input (i.e. one real number) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 40

41 Regression: scalar input
Output (label) assumed to be an unknown function of the input + noise that is normally distributed with zero mean and variance s2 r = f(x) + e Since the distribution of e has mean of zero p(r|x) = N(f(x),s2) Seek an estimator of the mean N(f(x),s2) with parameters q to be determined by MLE Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 41

42 Maximum likelihood estimation of g(x,q)
Log Likelihood How do we simplify this log-likelihood function? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 42

43 Product over t becomes sum over t
Next?

44 2 terms in the sum How do we evaluate this sum?

45 How do we choose q to maximize this log-likelihood
function?

46 Maximize likelihood by minimizing the sum of squared residuals
This optimization criterion is a consequence of the assumption of Gaussian distributed noise

47 Regression by linear least squares
Assume g(x|q) is a linear combination of n known functions (1, x, x2, …xn-1, for example) m = number of examples (xit, rit) in training set t Define mxn matrix A Aij = jth function in evaluated at xit q column vector of n unknown coefficients b column vector of m rit in training set If Aq = b has a solution, g(xit|q) = rit for all i Not what we want, why? with n << m, Aq = b has no exact solution 47 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

48 Look for an approximate solution which minimizes a norm of
Normal Equations Look for an approximate solution which minimizes a norm of the residual vector r = b – Aq Choosing the Euclidean norm, we minimize the sum of squared residuals define f(q) = ||r||2 = rTr f(q) = (b – Aq)T(b – Aq) = bTb –2qTATb + qTATAq A necessary condition for q0 to be minimum of f(q) is f(q0) = o f(q) = 2ATAq – 2ATb optimal set of parameters is a solution of nxn symmetric system of linear equations ATAq = ATb 48 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

49 Polynomial Regression
Vandermonde matrix for polynomial of degree k fit to N data points. Solve DTDx = DTr for k+1 coefficients Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

50 Coefficient of determination
Denominator is the sum of squared error when predictors are ignored. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 50

51 Tuning regression models
Create M in silico datasets of size N Use MLE to find M estimators gi(x) Average gi(x) to get best overall estimator Calculate bias and variance of best estimator best estimator f(x) is known, why? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 51

52 Bias-Variance Dilemma
As we increase model complexity (more free parameter q determined by minimizing) bias decreases (a better fit to given dataset) but variance increases (fit varies with datasets) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 52

53 f f gi g sin(x) + noise bias one in silico experiment
Linear regression: 5 experiments Each cubic has shape like f(x) Shape of estimators vary Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 53

54 Mean square error = bias2 + variance
Bias decreases with degree of polynomial fit Variance increases with degree of polynomial fit Best complexity Beyond 3 decreases in bias offset of increases in variance Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 54

55 Analysis of polynomial fit to real datasets
This analysis is not possible because we cannot calculate the bias, why?

56 “elbow” divide real data into training and validation sets
Compare training and validation errors as function of complexity Shape of validation error is best indicator of optimum complexity What is “validation error”? “elbow” Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 56

57 Regularization: balance bias and variance
Penalize complex models with small bias and large variance by adding a term to the error function E’=error on data + λ (metric of model complexity) Value of l is not part of q parameterization Best value of l determined by experience Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 57

58 Regularization in Polynomial Fitting
By penalizing large weights we get better results for higher-order polynomials w0 = 0 in all cases, why? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 58

59

60 Gaussian Parametric Classification
Define a discriminant function using posterior ignoring class-independent normalization Assume class likelihoods are Gaussian Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 60

61 Bias and Variance of Estimators
For datasets Xi drawn from a distribution with unknown parameter q, estimate of mean di is a random variable. Bias: bq (d) = E[d] – q Variance: E[(d–E[d])2] Mean square error: r (d,q) = E [(d–q)2] = (E[d] – q)2 + E[(d–E[d])2] = Bias2 + Variance q For real datasets, bias as cannot be calculated because q is unknown. In silico data adds noise to examples drawn from a distribution with specified q. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 61

62 Derive pi = ∑t xit / N for a multinomial sample
Likelihood of q given the sample X l (θ|X) = p (X |θ) = ∏t p (xt|θ) Log likelihood L(θ|X) = log l (θ|X) = ∑t log p (xt|θ) Maximum likelihood estimator (MLE) θ* = argmaxθ L(θ|X) To apply with constraints, need Lagrange multipliers

63 Bias and Variance - bias variance
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 63

64 Estimating Bias and Variance
M samples Xi={xti , rti}, i=1,...,M are used to fit gi (x), i =1,...,M Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 64

65 Bayesian Model Selection
Prior on models, p(model) Regularization, when prior favors simpler models Bayes, MAP of the posterior, p(model|data) Average over a number of models with high posterior (voting, ensembles: Chapter 17) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 65

66 More about Bayes’ Estimator
Ignoring priors the Maximum Likelihood (ML) estimate of q is θML = argmaxθ p(X|θ) Given posteriors we have other estimates Maximum a Posteriori (MAP): θMAP = argmaxθ p(θ|X) Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ By integrating over posteriors we can express unknown density in terms of sample p(x|X) = ∫ p(x|θ) p(θ|X) dθ suggesting non-parametric approaches Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 66

67 Parametric Classification
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 67

68 Error, Bias, and Variance:
Let X = {xt, rt} be samples drawn from joint distribution p(x, r) For each sample construct estimator g(x) and average <g(x)> Expected square error at x, E[(r-g(x))2|x], has contributions from noise and error in estimators. Due to noise, expected square error cannot be zero Expected square error in estimators can be written as the sum of squared bias = (E(r|x)-<g(x)>)2 and variance = EX[(g(x)-<g(x)>)2] Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 68


Download ppt "Parametric Estimation"

Similar presentations


Ads by Google