Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com

Over-fitting is easy to recognize in 1D Parabolic target function 4 th order hypothesis 5 data points -> E in = 0

The origin of over-fitting can be analyzed in 1D: Bias/variance dilemma. How does this apply to case on previous slide?

Shape of fit very sensitive to noise in data Out-of-sample error will vary greatly from one dataset to another.

Over-fitting is easy to avoid in 1D: Results from HW1 Sum of squared deviations Degree of polynomial E val E in

Using E val to avoid over-fitting works in all dimensions but computation grows rapidly for large d EE E in E cv-1 E val Digit recognition one vs not one d = 2 (intensity and symmetry) Terms in  5 (x) added successively 500 pts in training set Validation set needs to be large; 8798 this case

What if we want to add higher order terms to a linear model but don’t have enough data for a validation set? Solution: Augment the error function used to optimize weights Example Penalizes choices with large |w|. Called “weight decay”

Normal equations with weight decay essentially unchanged (Z T Z + I) w reg =Z T y

Best value is subjective In this case = 0.0001 is large enough to suppress swings but data still important in determining optimum weights.

Review for Quiz 2 Topics: linear models extending linear models by transformation dimensionality reduction over fitting and regularization 2 classes are distinguished by a threshold values of a linear combination of d attributes. Explain how h(w|x) = sign(w T x) becomes a hypothesis set for linear binary classification

More Review for Quiz 2 Topics: linear models extending linear models by transformation dimensionality reduction over fitting and regularization We have used 1-step optimization in 4 ways: polynomial regression in 1D (curve fitting) multivariate linear regression extending linear models by transformation regularization by weight decay 2 of these are equivalent; which ones

More Review for Quiz 2 Topics: linear models extending linear models by transformation dimensionality reduction over fitting and regularization 1-step optimization requires the in-sample error to be the sum of squared residuals. Define the in-sample error for following multivariate linear regression, extending linear models by transformation regularization by weight decay

For multivariate linear regression Derive the normal equations for extended linear regression with weight decay

Interpret the “learning curve” for multivariate linear regression when training data has normally distributed noise Why does E out approach  2 from above? Why does E in approach  2 from below? Why is E in not defined for N<d+1?

What do these learning curves say about simple vs complex models Still larger than bound set by noise

How do we estimate a good level of complexity without sacrificing training data?

Why chose 3 rather than 4?

Review:Maximum Likelihood Estimation Estimate parameters  of a probability distribution given a sample X drawn from that distribution 19 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

How was MLE used in logistic regression to derive an expression for in-sample error?

Since Log is a monotone increasing function, maximizing log(likelihood) is equivalent to minimizing -log(likelihood) Text also normalizes by dividing by N; hence error function becomes How?

Derive the log-likelihood function for a 1D Gaussian distribution

Stochastic gradient decent: correct weights by error in each data point

I want to perform PCA on a dataset. What must I assume about the noise in data? PCA

Correlation coefficients of normally distributed attributes x are zero. What can we say about the covariance of x More PCA

Attributes x are normally distributed with mean  and covariance . z = Mx is a linear transformation to feature space defined by matrix M. What are the mean and covariance of these features? More PCA

z k is the a feature defined by projection of attributes in the direction of the eigenvector w k of the covariance matrix. Prove that eigenvalue k is the variance of z k 29 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) More PCA

How do we find values of x 1 and x 2 that minimize f(x 1, x 2 ) subject to the constraint g(x 1, x 2 ) = c? Constrained optimization Find stationary points of f(x 1, x 2 ) = 1 - x 1 2 – x 2 2 subject to constraint g(x 1, x 2 ) = x 1 + x 2 = 1

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Similar presentations

Presentation on theme: "Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Similar presentations

Presentation on theme: "Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com."— Presentation transcript:

Similar presentations

About project

Feedback