Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Learning Dong Liu Dept. EEIS, USTC.

Similar presentations


Presentation on theme: "Statistical Learning Dong Liu Dept. EEIS, USTC."— Presentation transcript:

1 Statistical Learning Dong Liu Dept. EEIS, USTC

2 Chapter 1. Linear Regression
From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 2018/11/10 Chap 1. Linear Regression

3 Chap 1. Linear Regression
A motivating example 1/2 What is the height of Mount Qomolangma? A piece of knowledge – one variable How do we achieve this “knowledge” from data? We have a series of measurements: For example, we can use the (arithmetic) mean: def HeightOfQomolangma(): return def SLHeightOfQomolangma(data): return sum(data) / len(data) 2018/11/10 Chap 1. Linear Regression

4 Chap 1. Linear Regression
A motivating example 2/2 Or in this way hQomo = 0 def LearnHeightOfQomolangma(data): global hQomo hQomo = sum(data) / len(data) def UseHeightOfQomolangma(): return hQomo Learning Using 2018/11/10 Chap 1. Linear Regression

5 Chap 1. Linear Regression
Why arithmetic mean? Least squares In statistical learning, we often formulate such optimization problems and try to solve them How to formulate? How to solve? 2018/11/10 Chap 1. Linear Regression

6 From the statistical perspective
The height of Qomolangma is a random variable , which subjects to a specific probability distribution For example, Gaussian (normal) distribution The measurements are observations of the random variable, and are used to estimate the distribution Assumption: independent and identical distribution (iid) 2018/11/10 Chap 1. Linear Regression

7 Maximum likelihood estimation
Likelihood function: as a function of Overall likelihood function (recall iid): We need to find a parameter that maximizes the overall likelihood: And it reduces to least squares! 2018/11/10 Chap 1. Linear Regression

8 Chap 1. Linear Regression
More is implied We can also estimate other parameters, e.g. We can use other estimators, like unbiased: We can give range estimation rather than point estimation 2018/11/10 Chap 1. Linear Regression

9 Chap 1. Linear Regression
Correlated variables The height of Mount Qomolangma is correlated to the season So what is the correlation between two variables? Why not an affine function: def UseSeasonalHeight(x, a, b): return a*x+b Spring Summer Fall Winter 2018/11/10 Chap 1. Linear Regression

10 Chap 1. Linear Regression
Least squares We formulate the optimization problem as And (fortunately) have the closed-form solution Result ↗ Seemingly not good, how to improve? 2018/11/10 Chap 1. Linear Regression

11 Chap 1. Linear Regression
Variable (re)mapping Previously we use Now we use Result ↗ Spring 1 Summer 2 Fall 3 Winter 1 Summer 2 Spring/Fall 3 Winter def ErrorOfHeight(datax, datay, a, b): fity = UseSeasonalHeight(datax, a, b) error = datay - fity return sum(error**2) Season: Remapped season: 2018/11/10 Chap 1. Linear Regression

12 From the statistical perspective
We have two random variables Height: a dependent, continuous variable Season: an independent, discrete variable The season’s probability distribution The height’s probability distribution The overall likelihood function: 2018/11/10 Chap 1. Linear Regression

13 Chap 1. Linear Regression
History review Adrien-Marie Legendre (French, ) Carl Friedrich Gauss (German, ) 2018/11/10 Chap 1. Linear Regression

14 Chap 1. Linear Regression
Notes Correlation is not Causality, but inspires efforts to interpret Remapped/Latent variables are important 2018/11/10 Chap 1. Linear Regression

15 Chapter 1. Linear Regression
From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 2018/11/10 Chap 1. Linear Regression

16 As we are not confident about our data
Height is correlated to season, but also correlated to other variables Can we constrain the level of correlation between height and season? So we want to constrain the slope parameter We have two choices Given a range of possible values of slope parameter, find the least squares Minimize the least squares and the (e.g. square of) slope parameter simultaneously 2018/11/10 Chap 1. Linear Regression

17 The Lagrange multiplier
Constraint form Unconstrained form Solution Reg 0: y = * x , error: Reg 1: y = * x , error: Reg 2: y = * x , error: Reg 3: y = * x , error: Reg 4: y = * x , error: Reg 5: y = * x , error: Reg 6: y = * x , error: Reg 7: y = * x , error: Reg 8: y = * x , error: Reg 9: y = * x , error: Reg 10: y = * x , error: 2018/11/10 Chap 1. Linear Regression

18 More about the Lagrange multiplier method 1/2
Example: No constraint With equality as constraint With inequality as constraint 2018/11/10 Chap 1. Linear Regression

19 More about the Lagrange multiplier method 2/2
A necessary (but not sufficient) condition for convex optimization: Using the Lagrange multiplier method: KKT condition 2018/11/10 Chap 1. Linear Regression

20 What & Why is regularization?
A process of introducing additional information in order to solve an ill-posed problem Why Want to introduce additional information Have difficulty in solving the ill-posed problem Without regularization: With regularization: 2018/11/10 Chap 1. Linear Regression

21 From the statistical perspective
The Bayes formula Maximum a posterior (MAP) estimation (Bayesian estimation) We need to specify a prior, e.g.: Finally it reduce to the regularized least squares with 2018/11/10 Chap 1. Linear Regression

22 Bayesian interpretation of regularization
The prior is “additional information” Many statisticians question this point How much regularization depends on How confident we are about the data How confident we are about the prior 2018/11/10 Chap 1. Linear Regression

23 Chapter 1. Linear Regression
From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 2018/11/10 Chap 1. Linear Regression

24 Polynomial curve fitting
Basis functions Weights Another form: weights and bias 2018/11/10 Chap 1. Linear Regression

25 Chap 1. Linear Regression
Basis functions Global vs. Local Polynomial Gaussian Other choices: Fourier basis (sinusoidal), wavelet, spline Sigmoid 2018/11/10 Chap 1. Linear Regression

26 Chap 1. Linear Regression
Variable remapping Using basis functions will remap the variable(s) in a non-linear manner Change the dimensionality To enable a simpler (linear) model 2018/11/10 Chap 1. Linear Regression

27 Chap 1. Linear Regression
Maximum likelihood Assume observations are from a deterministic function with additive Gaussian noise Then Given observed inputs and targets The likelihood function is 2018/11/10 Chap 1. Linear Regression

28 Maximum likelihood and least squares
Maximizing is equivalent to minimizing which is known as sum of squared errors (SSE) 2018/11/10 Chap 1. Linear Regression

29 Maximum likelihood solution
Solution is The design matrix The pseudo-inverse 2018/11/10 Chap 1. Linear Regression

30 Geometrical interpretation
Let And let the columns of be They span a subspace Then, is the orthogonal projection of on the subspace , so as to minimize the Euclidean distance 2018/11/10 Chap 1. Linear Regression

31 Regularized least squares
Construct the “joint” error function Use SSE as data term, and quadratic regularization term (ridge regression): The solution is Data term + Regularization term 2018/11/10 Chap 1. Linear Regression

32 Chap 1. Linear Regression
Equivalent kernel For a new input, the predicted output is Predictions can be calculated directly from the equivalent kernel, without calculating the parameters Equivalent kernel 2018/11/10 Chap 1. Linear Regression

33 Equivalent kernel for Gaussian basis functions
2018/11/10 Chap 1. Linear Regression

34 Equivalent kernel for other basis functions
Polynomial Sigmoidal Equivalent kernel is “local”: nearby points have more weights 2018/11/10 Chap 1. Linear Regression

35 Properties of equivalent kernel
Sums to 1 if λ is 0 May have negative values Can be seen as inner product 2018/11/10 Chap 1. Linear Regression

36 Chapter 1. Linear Regression
From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 2018/11/10 Chap 1. Linear Regression

37 Chap 1. Linear Regression
Example Reproduced from PRML Generate 100 data sets, each having 25 points A sine function plus Gaussian noise Perform ridge regression on each data set with 24 Gaussian basis functions and different values of regularization weight 2018/11/10 Chap 1. Linear Regression

38 Chap 1. Linear Regression
Simulation results 1/3 High regularization, the variance is small but bias is large Fitted curve (shown 20 fits) The average curve after 100 fits 2018/11/10 Chap 1. Linear Regression

39 Chap 1. Linear Regression
Simulation results 2/3 Moderate regularization Fitted curve (shown 20 fits) The average curve after 100 fits 2018/11/10 Chap 1. Linear Regression

40 Chap 1. Linear Regression
Simulation results 3/3 Low regularization, the variance is large but bias is small Fitted curve (shown 20 fits) The average curve after 100 fits 2018/11/10 Chap 1. Linear Regression

41 Bias-variance decomposition
The second term is intrinsic “noise”, consider the first term Suppose we have a dataset and we can calculate the parameter based on the dataset Then we take expectation with respect to dataset Finally we have: expected “loss” = (bias)2 + variance + noise 2018/11/10 Chap 1. Linear Regression

42 Bias-variance trade-off
Over-regularized model will have a high bias, while under-regularized model will have a high variance How can we achieve the trade-off? For example, by cross validation (will be discussed later) 2018/11/10 Chap 1. Linear Regression

43 Chapter 1. Linear Regression
From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 2018/11/10 Chap 1. Linear Regression

44 Chap 1. Linear Regression
Other forms? Least squares Ridge regression Norm regularized regression Lq norm: 2018/11/10 Chap 1. Linear Regression

45 Chap 1. Linear Regression
Different norms What about 0 and ∞? 2018/11/10 Chap 1. Linear Regression

46 Chap 1. Linear Regression
Best subset selection Define l0 “norm” as Best subset selection regression: Also known as “sparse” Unfortunately, this is NP-hard 2018/11/10 Chap 1. Linear Regression

47 Example: Why we need sparsity?
fMRI data help to understand brain’s functionality Brain fMRI data may consists of 10,000~100,000 voxels We want to identify the most relevant anchor points 2018/11/10 Chap 1. Linear Regression

48 L1 norm replacing l0 “norm”
Interestingly, we can use L1 norm to replace l0 “norm,” and still achieve a sparse solution* Geometric interpretation Left: L1 norm; Right: L2 norm Least squares solution Sparsity here! * Donoho, D. L. (2006). For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6), 2018/11/10 Chap 1. Linear Regression

49 Chap 1. Linear Regression
LASSO regression LASSO: least absolute shrinkage and selection operator* Bayesian interpretation Laplace distribution as prior Laplace distribution * Tibshiranit, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B-Methodological, 58(1), 2018/11/10 Chap 1. Linear Regression

50 Solution of LASSO regression
Consider a special case: The least-squares solution is And the solution is 2018/11/10 Chap 1. Linear Regression

51 Comparison between best subset, LASSO, and ridge
Consider the special case: orthonormal design matrix Best subset: Hard thresholding Ridge: Uniformly shrink the LS solution LASSO: Soft thresholding 2018/11/10 Chap 1. Linear Regression

52 Implications of different norms
Best subset LASSO ridge q = 0 q = ∞ Sparse solution Convex optimization 2018/11/10 Chap 1. Linear Regression

53 Chapter 1. Linear Regression
From one to two Regularization Basis functions Bias-variance decomposition Different regularization forms Bayesian approach 2018/11/10 Chap 1. Linear Regression

54 Bayesian linear regression
Define prior for the parameters Note the likelihood function is The posterior is 2018/11/10 Chap 1. Linear Regression

55 Chap 1. Linear Regression
MAP estimation The maximum a posteriori (MAP) estimation Compared with the maximum likelihood estimation Compared with the ridge regression solution 2018/11/10 Chap 1. Linear Regression

56 Chap 1. Linear Regression
How to set the prior If using zero-mean Gaussian prior, the Bayesian estimation is equivalent to ridge regression solution If using zero-mean Laplace prior, the Bayesian estimation (no closed- form expression) is equivalent to lasso regression solution Conjugate prior: to make the posterior and prior follow the same category of distribution, e.g. Gaussian 2018/11/10 Chap 1. Linear Regression

57 Chap 1. Linear Regression
Example Reproduced from PRML 0 data points observed Parameters for simulation: 2018/11/10 Chap 1. Linear Regression

58 Chap 1. Linear Regression
Simulation results 1/3 1 data point observed Likelihood Posterior Data Space 2018/11/10 Chap 1. Linear Regression

59 Chap 1. Linear Regression
Simulation results 2/3 2 data points observed Likelihood Posterior Data Space 2018/11/10 Chap 1. Linear Regression

60 Chap 1. Linear Regression
Simulation results 3/3 20 data points observed Posterior Data Space Variance of the posterior decreases as the number of data points increases 2018/11/10 Chap 1. Linear Regression

61 Predictive distribution
In the Bayesian framework, every variable has a distribution, such as the predicted output given an input As N increases this term will vanish 2018/11/10 Chap 1. Linear Regression

62 Chap 1. Linear Regression
Example Reproduced from PRML Sinusoidal data, 9 Gaussian basis functions, 1 data point True function Predictive mean Predictive variance Different predicted functions 2018/11/10 Chap 1. Linear Regression

63 Chap 1. Linear Regression
Simulation results 1/3 Sinusoidal data, 9 Gaussian basis functions, 2 data points 2018/11/10 Chap 1. Linear Regression

64 Chap 1. Linear Regression
Simulation results 2/3 Sinusoidal data, 9 Gaussian basis functions, 4 data points 2018/11/10 Chap 1. Linear Regression

65 Chap 1. Linear Regression
Simulation results 3/3 Sinusoidal data, 9 Gaussian basis functions, 25 data points 2018/11/10 Chap 1. Linear Regression

66 Chap 1. Linear Regression
Model selection Polynomial curve fitting: How to set the order? 2018/11/10 Chap 1. Linear Regression

67 From the statistical perspective
Bayesian model selection: Given the dataset, estimate the posterior of different models “ML” model selection: Choose the model that maximizes the model evidence function 2018/11/10 Chap 1. Linear Regression

68 Calculation of model evidence for Bayesian linear regression
Details c.f. PRML 2018/11/10 Chap 1. Linear Regression

69 Chap 1. Linear Regression
Example Reproduced from PRML 2018/11/10 Chap 1. Linear Regression

70 More about the hyper-parameters
We can “estimate” the hyper-parameters based on e.g. ML criterion Define the eigenvalues as 2018/11/10 Chap 1. Linear Regression

71 Chap 1. Linear Regression
Interpretation 1/3 The eigenvalues of are 2018/11/10 Chap 1. Linear Regression

72 Chap 1. Linear Regression
Interpretation 2/3 By decreasing α, more parameters become “learnt from data” γ measures the number of “learnt” parameters 2018/11/10 Chap 1. Linear Regression

73 Chap 1. Linear Regression
Interpretation 3/3 Recall that for estimating Gaussian parameters, we have Now for Bayesian linear regression, we have 2018/11/10 Chap 1. Linear Regression

74 Chap 1. Linear Regression
Notes The hyper-parameters can be further regarded as random variables, and integrated into the Bayesian framework 2018/11/10 Chap 1. Linear Regression

75 Chap 1. Linear Regression
Chapter summary Dictionary Toolbox Bias-variance decomposition Equivalent kernel Gaussian distribution Laplace distribution KKT condition Model selection Prior; conjugate ~ Posterior Regularization Sparsity Basis functions Best subset selection Lagrange multiplier LASSO regression Least squares MAP (Bayesian) estimation ML estimation Ridge regression 2018/11/10 Chap 1. Linear Regression


Download ppt "Statistical Learning Dong Liu Dept. EEIS, USTC."

Similar presentations


Ads by Google