Statistical Learning Dong Liu Dept. EEIS, USTC.

Statistical Learning Dong Liu Dept. EEIS, USTC

Chapter 1. Linear Regression
From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 2018/11/10 Chap 1. Linear Regression

Chap 1. Linear Regression
A motivating example 1/2 What is the height of Mount Qomolangma? A piece of knowledge – one variable How do we achieve this “knowledge” from data? We have a series of measurements: For example, we can use the (arithmetic) mean: def HeightOfQomolangma(): return def SLHeightOfQomolangma(data): return sum(data) / len(data) 2018/11/10 Chap 1. Linear Regression

A motivating example 2/2 Or in this way hQomo = 0 def LearnHeightOfQomolangma(data): global hQomo hQomo = sum(data) / len(data) def UseHeightOfQomolangma(): return hQomo Learning Using 2018/11/10 Chap 1. Linear Regression

Why arithmetic mean? Least squares In statistical learning, we often formulate such optimization problems and try to solve them How to formulate? How to solve? 2018/11/10 Chap 1. Linear Regression

From the statistical perspective
The height of Qomolangma is a random variable , which subjects to a specific probability distribution For example, Gaussian (normal) distribution The measurements are observations of the random variable, and are used to estimate the distribution Assumption: independent and identical distribution (iid) 2018/11/10 Chap 1. Linear Regression

Maximum likelihood estimation
Likelihood function: as a function of Overall likelihood function (recall iid): We need to find a parameter that maximizes the overall likelihood: And it reduces to least squares! 2018/11/10 Chap 1. Linear Regression

More is implied We can also estimate other parameters, e.g. We can use other estimators, like unbiased: We can give range estimation rather than point estimation 2018/11/10 Chap 1. Linear Regression

Correlated variables The height of Mount Qomolangma is correlated to the season So what is the correlation between two variables? Why not an affine function: def UseSeasonalHeight(x, a, b): return a*x+b Spring Summer Fall Winter 2018/11/10 Chap 1. Linear Regression

Least squares We formulate the optimization problem as And (fortunately) have the closed-form solution Result ↗ Seemingly not good, how to improve? 2018/11/10 Chap 1. Linear Regression

Variable (re)mapping Previously we use Now we use Result ↗ Spring 1 Summer 2 Fall 3 Winter 1 Summer 2 Spring/Fall 3 Winter def ErrorOfHeight(datax, datay, a, b): fity = UseSeasonalHeight(datax, a, b) error = datay - fity return sum(error**2) Season: Remapped season: 2018/11/10 Chap 1. Linear Regression

We have two random variables Height: a dependent, continuous variable Season: an independent, discrete variable The season’s probability distribution The height’s probability distribution The overall likelihood function: 2018/11/10 Chap 1. Linear Regression

History review Adrien-Marie Legendre (French, ) Carl Friedrich Gauss (German, ) 2018/11/10 Chap 1. Linear Regression

Notes Correlation is not Causality, but inspires efforts to interpret Remapped/Latent variables are important 2018/11/10 Chap 1. Linear Regression

As we are not confident about our data
Height is correlated to season, but also correlated to other variables Can we constrain the level of correlation between height and season? So we want to constrain the slope parameter We have two choices Given a range of possible values of slope parameter, find the least squares Minimize the least squares and the (e.g. square of) slope parameter simultaneously 2018/11/10 Chap 1. Linear Regression

The Lagrange multiplier
Constraint form Unconstrained form Solution Reg 0: y = * x , error: Reg 1: y = * x , error: Reg 2: y = * x , error: Reg 3: y = * x , error: Reg 4: y = * x , error: Reg 5: y = * x , error: Reg 6: y = * x , error: Reg 7: y = * x , error: Reg 8: y = * x , error: Reg 9: y = * x , error: Reg 10: y = * x , error: 2018/11/10 Chap 1. Linear Regression

More about the Lagrange multiplier method 1/2
Example: No constraint With equality as constraint With inequality as constraint 2018/11/10 Chap 1. Linear Regression

More about the Lagrange multiplier method 2/2
A necessary (but not sufficient) condition for convex optimization: Using the Lagrange multiplier method: KKT condition 2018/11/10 Chap 1. Linear Regression

What & Why is regularization?
A process of introducing additional information in order to solve an ill-posed problem Why Want to introduce additional information Have difficulty in solving the ill-posed problem Without regularization: With regularization: 2018/11/10 Chap 1. Linear Regression

The Bayes formula Maximum a posterior (MAP) estimation (Bayesian estimation) We need to specify a prior, e.g.: Finally it reduce to the regularized least squares with 2018/11/10 Chap 1. Linear Regression

Bayesian interpretation of regularization
The prior is “additional information” Many statisticians question this point How much regularization depends on How confident we are about the data How confident we are about the prior 2018/11/10 Chap 1. Linear Regression

Polynomial curve fitting
Basis functions Weights Another form: weights and bias 2018/11/10 Chap 1. Linear Regression

Basis functions Global vs. Local Polynomial Gaussian Other choices: Fourier basis (sinusoidal), wavelet, spline Sigmoid 2018/11/10 Chap 1. Linear Regression

Variable remapping Using basis functions will remap the variable(s) in a non-linear manner Change the dimensionality To enable a simpler (linear) model 2018/11/10 Chap 1. Linear Regression

Maximum likelihood Assume observations are from a deterministic function with additive Gaussian noise Then Given observed inputs and targets The likelihood function is 2018/11/10 Chap 1. Linear Regression

Maximum likelihood and least squares
Maximizing is equivalent to minimizing which is known as sum of squared errors (SSE) 2018/11/10 Chap 1. Linear Regression

Maximum likelihood solution
Solution is The design matrix The pseudo-inverse 2018/11/10 Chap 1. Linear Regression

Geometrical interpretation
Let And let the columns of be They span a subspace Then, is the orthogonal projection of on the subspace , so as to minimize the Euclidean distance 2018/11/10 Chap 1. Linear Regression

Regularized least squares
Construct the “joint” error function Use SSE as data term, and quadratic regularization term (ridge regression): The solution is Data term + Regularization term 2018/11/10 Chap 1. Linear Regression

Equivalent kernel For a new input, the predicted output is Predictions can be calculated directly from the equivalent kernel, without calculating the parameters Equivalent kernel 2018/11/10 Chap 1. Linear Regression

Equivalent kernel for Gaussian basis functions
2018/11/10 Chap 1. Linear Regression

Equivalent kernel for other basis functions
Polynomial Sigmoidal Equivalent kernel is “local”: nearby points have more weights 2018/11/10 Chap 1. Linear Regression

Properties of equivalent kernel
Sums to 1 if λ is 0 May have negative values Can be seen as inner product 2018/11/10 Chap 1. Linear Regression

Example Reproduced from PRML Generate 100 data sets, each having 25 points A sine function plus Gaussian noise Perform ridge regression on each data set with 24 Gaussian basis functions and different values of regularization weight 2018/11/10 Chap 1. Linear Regression

Simulation results 1/3 High regularization, the variance is small but bias is large Fitted curve (shown 20 fits) The average curve after 100 fits 2018/11/10 Chap 1. Linear Regression

Simulation results 2/3 Moderate regularization Fitted curve (shown 20 fits) The average curve after 100 fits 2018/11/10 Chap 1. Linear Regression

Simulation results 3/3 Low regularization, the variance is large but bias is small Fitted curve (shown 20 fits) The average curve after 100 fits 2018/11/10 Chap 1. Linear Regression

Bias-variance decomposition
The second term is intrinsic “noise”, consider the first term Suppose we have a dataset and we can calculate the parameter based on the dataset Then we take expectation with respect to dataset Finally we have: expected “loss” = (bias)2 + variance + noise 2018/11/10 Chap 1. Linear Regression

Bias-variance trade-off
Over-regularized model will have a high bias, while under-regularized model will have a high variance How can we achieve the trade-off? For example, by cross validation (will be discussed later) 2018/11/10 Chap 1. Linear Regression

Other forms? Least squares Ridge regression Norm regularized regression Lq norm: 2018/11/10 Chap 1. Linear Regression

Different norms What about 0 and ∞? 2018/11/10 Chap 1. Linear Regression

Best subset selection Define l0 “norm” as Best subset selection regression: Also known as “sparse” Unfortunately, this is NP-hard 2018/11/10 Chap 1. Linear Regression

Example: Why we need sparsity?
fMRI data help to understand brain’s functionality Brain fMRI data may consists of 10,000~100,000 voxels We want to identify the most relevant anchor points 2018/11/10 Chap 1. Linear Regression

L1 norm replacing l0 “norm”
Interestingly, we can use L1 norm to replace l0 “norm,” and still achieve a sparse solution* Geometric interpretation Left: L1 norm; Right: L2 norm Least squares solution Sparsity here! * Donoho, D. L. (2006). For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6), 2018/11/10 Chap 1. Linear Regression

LASSO regression LASSO: least absolute shrinkage and selection operator* Bayesian interpretation Laplace distribution as prior Laplace distribution * Tibshiranit, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B-Methodological, 58(1), 2018/11/10 Chap 1. Linear Regression

Solution of LASSO regression
Consider a special case: The least-squares solution is And the solution is 2018/11/10 Chap 1. Linear Regression

Comparison between best subset, LASSO, and ridge
Consider the special case: orthonormal design matrix Best subset: Hard thresholding Ridge: Uniformly shrink the LS solution LASSO: Soft thresholding 2018/11/10 Chap 1. Linear Regression

Implications of different norms
Best subset LASSO ridge q = 0 q = ∞ Sparse solution Convex optimization 2018/11/10 Chap 1. Linear Regression

Bayesian linear regression
Define prior for the parameters Note the likelihood function is The posterior is 2018/11/10 Chap 1. Linear Regression

MAP estimation The maximum a posteriori (MAP) estimation Compared with the maximum likelihood estimation Compared with the ridge regression solution 2018/11/10 Chap 1. Linear Regression

How to set the prior If using zero-mean Gaussian prior, the Bayesian estimation is equivalent to ridge regression solution If using zero-mean Laplace prior, the Bayesian estimation (no closed- form expression) is equivalent to lasso regression solution Conjugate prior: to make the posterior and prior follow the same category of distribution, e.g. Gaussian 2018/11/10 Chap 1. Linear Regression

Example Reproduced from PRML 0 data points observed Parameters for simulation: 2018/11/10 Chap 1. Linear Regression

Simulation results 1/3 1 data point observed Likelihood Posterior Data Space 2018/11/10 Chap 1. Linear Regression

Simulation results 2/3 2 data points observed Likelihood Posterior Data Space 2018/11/10 Chap 1. Linear Regression

Simulation results 3/3 20 data points observed Posterior Data Space Variance of the posterior decreases as the number of data points increases 2018/11/10 Chap 1. Linear Regression

Predictive distribution
In the Bayesian framework, every variable has a distribution, such as the predicted output given an input As N increases this term will vanish 2018/11/10 Chap 1. Linear Regression

Example Reproduced from PRML Sinusoidal data, 9 Gaussian basis functions, 1 data point True function Predictive mean Predictive variance Different predicted functions 2018/11/10 Chap 1. Linear Regression

Simulation results 1/3 Sinusoidal data, 9 Gaussian basis functions, 2 data points 2018/11/10 Chap 1. Linear Regression

Model selection Polynomial curve fitting: How to set the order? 2018/11/10 Chap 1. Linear Regression

Bayesian model selection: Given the dataset, estimate the posterior of different models “ML” model selection: Choose the model that maximizes the model evidence function 2018/11/10 Chap 1. Linear Regression

Calculation of model evidence for Bayesian linear regression
Details c.f. PRML 2018/11/10 Chap 1. Linear Regression

Example Reproduced from PRML 2018/11/10 Chap 1. Linear Regression

More about the hyper-parameters
We can “estimate” the hyper-parameters based on e.g. ML criterion Define the eigenvalues as 2018/11/10 Chap 1. Linear Regression

Interpretation 1/3 The eigenvalues of are 2018/11/10 Chap 1. Linear Regression

Interpretation 2/3 By decreasing α, more parameters become “learnt from data” γ measures the number of “learnt” parameters 2018/11/10 Chap 1. Linear Regression

Interpretation 3/3 Recall that for estimating Gaussian parameters, we have Now for Bayesian linear regression, we have 2018/11/10 Chap 1. Linear Regression

Notes The hyper-parameters can be further regarded as random variables, and integrated into the Bayesian framework 2018/11/10 Chap 1. Linear Regression

Chapter summary Dictionary Toolbox Bias-variance decomposition Equivalent kernel Gaussian distribution Laplace distribution KKT condition Model selection Prior; conjugate ~ Posterior Regularization Sparsity Basis functions Best subset selection Lagrange multiplier LASSO regression Least squares MAP (Bayesian) estimation ML estimation Ridge regression 2018/11/10 Chap 1. Linear Regression

Statistical Learning Dong Liu Dept. EEIS, USTC.

Similar presentations

Presentation on theme: "Statistical Learning Dong Liu Dept. EEIS, USTC."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Learning Dong Liu Dept. EEIS, USTC.

Similar presentations

Presentation on theme: "Statistical Learning Dong Liu Dept. EEIS, USTC."— Presentation transcript:

Similar presentations

About project

Feedback