# Lecture 4. Linear Models for Regression

## Presentation on theme: "Lecture 4. Linear Models for Regression"— Presentation transcript:

Lecture 4. Linear Models for Regression

Outline Linear Regression Least Square Solution Subset Least Square
subset selection/forward/backward Penalized Least Square: Ridge Regression LASSO Elastic Nets (LASSO+Ridge)

Linear Methods for Regression
Input (FEATURES) Vector: (p-dimensional) X = X1, X2, …, Xp Real Valued OUTPUT: Y Joint Distribution of (Y,X ) Function: Regression Function E(Y |X ) = f(X) Training Data : (x1, y1), (x2, y2), …, (xN, yN) for estimation of input-output relation f.

Linear Model f(x): Regression function or a good approximation
LINEAR in Unknown Parameters(weights, coefficients)

Features Quantitative inputs
Any arbitrary but known function of measured attributes Transformations of quantitative attributes: g(x), e.g., log, square, square-root etc. Basis expansions: e.g., a polynomial approximation of f as a function of X1 (Taylor Series expansion with unknown coefficients)

Features (Cont.) Qualitative (categorical) input G
Dummy Codes: For an attribute with k categories, may use k codes j = 1,2, …, k, as indicators of the category (level) used. Together, this collection of inputs represents the effect of G through This is a set of level-dependent constants, since only one of the Xj equals one and others are zero

Features(cont) Interactions: 2nd or higher-order Interactions of some features, e.g., Feature vector for the ith case in training set (Example)

Generalized Linear Models: Basis Expansion
Wide variety of flexible models Model for f is a linear expansion of basis functions Dictionary: Prescribed basis functions

Other Basis Functions Polynomial basis of degree s (Smooth functions Cs). Fourie Series (Band-limited functions, a compact subspace of C∞) Splines Piecewise polynomials of degree K between the knots, joined with continuity of degree K-1 at the knots (Sobolev Spaces). Wavelets (Besov Spaces) Radial Basis Functions: Symmetric p-dim kernels located at particular centroids f(|x-y|) Gaussian Kernel at each centroids And more … -- Curse of Dimensionality: p could be equal to or much larger than n.

Method of Least Squares
Find coefficients that minimize Residual Sum of Squares, RSS(b) = RSS denotes the empirical risk over the training set. It doesn’t assure the predictive performance over all inputs of interest.

Min RSS Criterion Statistically Reasonable provided Examples in the Training Set Large # of independent random draws from the inputs population for which prediction is desirable. Given inputs (x1, x2, …, xN), the outputs (y1, y2, …, yN) conditionally independent In principle, predictive performance over the set of future input vectors should be examined. Gaussian Noise: the Least Squares method equivalent to Max Likelihood Min RSS(b) over b in R(p+1), a quadratic function of b. Optimal Solution: Take the derivatives with respect to elements of b, and set them equal to zero.

XT(Y-X b) = 0 or (XTX) b = XTY
Optimal Solution The Hession (2nd derivative) of the criterion function is given by XTX. The optimal solution satisfies the normal equations XT(Y-X b) = 0 or (XTX) b = XTY For an unique solution, the matrix XTX must full rank.

Projection When the matrix XTX is full rank. the estimated response for the training set: H: Projection (Hat) matrix HY: Orthogonal Projection of Y on the space spanned by the columns of X Note: the projection is linear in Y

Geometrical Insight

Simple Univariate Regression
One Variable with no intercept LS estimate inner product = cosine (angle between vectors x and y), a measure of similarity between y and x Residuals: projection on normal space Definition: “Regress b on a” Simple regression of response b and input a, with no intercept Estimate Residual “b adjusted for a” “b orthogonalized with respect to a”

Multiple Regression Multiple Regression:p>1
LS estimates different from simple univariate regression estimates, unless columns of input matrix X orthogonal, If then These estimates are uncorrelated, and Orthogonal inputs occur sometimes in balanced, designed experiments (experimental design). Observational studies, will almost never have orthogonal inputs Must “orthogonalize” them in order to have similar interpretation Use Gram-Schmidt procedure to obtain an orthogonal basis for multiple regression

Multiple Regression Estimates: Sequence of Simple Regressions
Regression by Successive Orthogonalization: Initialize For, j=1, 2, …, p, Regress to produce coefficients and residual vectors Regress y on the residual vector for the estimate Instead of using x1 and x2, take x1 and z as features

Multiple Regression = Gram-Schmidt Orthogonalization Procedure
The vector zp is the residual of the multiple regression of xp on all other inputs Successive z’s in the above algorithm are orthogonal and form an orthogonal basis for the columns space of X. The least squares projection onto this subspace is the usual By re-arranging the order of these variables, any input can be labeled as the pth variable. If is highly correlated with other variables, the residuals are quite small, and the coefficient has high variance.

Statistics Properties of LS
Model Uncorrelated noise: Mean zero, Variance Then Noise estimation Model d.f. = p+1 (dimension of the model space) To Draw inferences on parameter estimates, we need assumptions on noise: If assume: then,

Gauss-Markov Theorem (The Gauss-Markov Theorem) If we have any linear estimator that is unbiased for aT β, that is, E(cT y)= aT β,then It says that, for inputs in row space of X, LS estimate have Minimum variance among all unbiased estimates.

Mean square error of an estimator = variance + bias Least square estimator achieves the minimal variance among all unbiased estimators There are biased estimators to further reduce variance: Stein’s estimator, Shrinkage/Thresholding (LASSO, etc.) More complicated a model is, more variance but less bias, need a trade-off

Hypothesis Test Single Parameter test: βj=0, T-statistics
where vj is the j-th diagonal element of V = (XTX)-1 Confidence interval , e.g. z =1.96 Group parameter: , F-statistics for nested models

Example R command: lm(Y ~ x1 + x2 + … +xp)

Rank Deficiency Penalized methods (!) X : rank deficient
Normal equations has infinitely many solutions Hat matrix H, and the projection are unique. For an input in the row space of X, unique LS estimate. For an input, not in the row space of X, the estimate may change with the solution used. How to generalize to inputs outside the training set? Penalized methods (!)

Reasons for Alternatives to LS Estimates
Prediction accuracy LS estimates have low bias but high variance when inputs are highly correlated Larger ESPE Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero. Small bias in estimates may yield a large decrease in variance Bias/var tradeoff may provide better predictive ability Better interpretation With a large number of input variables, like to determine a smaller subset that exhibit the strongest effects. Many tools to achieve these objectives Subset selection Penalized regression - constrained optimization

Best Subset Selection Method
Algorithm: leaps & bounds Find the best subset corresponding to the smallest RSS for each size For each fixed size k, can also find a specified number of subsets close to the best For each fixed subset, obtain LS estimates Feasible for p ~ 40. Choice of optimal k based on model selection criteria to be discussed later

Other Subset Selection Procedures
Larger p, Classical Forward selection (step-up), Backward elimination (step down) Hybrid forward-backward (step-wise) methods Given a model, these methods only provide local controls for variable selection or deletion Which current variable is least effective (candidate for deletion) Which variable not in the model is most effective (candidate for inclusion) Do not attempt to find the best subset of a given size Not too popular in current practice

Forward Stagewise Selection
(Incremental) Forward stagewise Standardize the input variables Note:

Penalized Regression Instead of directly minimize the Residual Sum Square, The penalized regression usually take the form: where J(f) is the penalization term, usually penalize on the smoothness or complexity of the function f λ is chosen by cross-validation.

Model Assessment and Selection
If we are in data-rich situation, split data into three parts: training, validation, and testing. Train Validation Test See chapter 7.1 for details

Cross Validation When sample size not sufficiently large, Cross Validation is a way to estimate the out of sample estimation error (or classification rate). Available Data Training Test Randomly split error1 Split many times and get error2, …, errorm ,then average over all error to get an estimate

Ridge Regression (Tikhonov Regularization)
Prostate Cancer Example Ridge regression shrinks coefficients by imposing a penalty on their size Min a penalized RSS Here is complexity parameter, that controls the amount of shrinkage Larger its value, greater the amount of shrinkage Coefficients are shrunk towards zero Choice of penalty term based on cross validation

Ridge Regression (cont)
Equivalent problem Min RSS subject to Lagrangian multiplier 1-1 correspondence between s and With many correlated variables, LS estimates can become unstable and exhibit high variance and high correlations A widely large positive coeff on one variable can be cancelled by a large negative coeff on another Imposing a size constraint on the coefficients, this phenomena is prevented from occurring Ridge solutions are not invariant under scaling of inputs Normally standardize the inputs before solving the optimization problem Since the penalty term does not include the bias term, estimate intercept by the mean of response y.

Ridge Regression (cont)
The Ridge criterion Shrinkage: For orthogonal inputs, ridge: scaled version of LS estimates Ridge is mean or mode of posterior distribution of under a normal prior Centered input matrix X SVD of X: U and V are orthogonal matrices Columns of U span column space of X Columns of V span row space of X D: a diagonal matrix of singular values Eigen decomposition of The Eigen vectors : principal components directions of X (Karhunen-Loeve direction)

Ridge Regression and Principal Components
First PC direction : Among all normalized linear combinations of columns of X, the has largest sample variance Derived variable, is first PC of X. Subsequent PC have max variance subject to being orthogonal to earlier ones. Last PC has min variance Effective Degree of Freedmon

Ridge Regression (Summary)
Ridge Regression penalized the complexity of a linear model by the sum squares of the coefficients It is equivalent to minimize RRS given the constraints The matrix (XTX+ I) is always invertable. The penalization parameter  controls how simple “you” want the model to be.

Prostate Cancer Example

Ridge Regression (Summary)
Prostate Cancer Example Solutions are not sparse in the coefficient space. - ’s are not zero almost all the time. The computation complexity is O(p3) when inversing the matrix XTX+ I.

Least Absolute Shrinkage and Selection Operator (LASSO)
Penalized RSS with L1-norm penalty, or subject to constraint Shrinks like Ridge with L2- norm penalty, but LASSO coefficients hit zero, as the penalty increases.

LASSO as Penalized Regression
Instead of directly minimize the Residual Sum Square, The penalized regression usually take the form: where

LASSO(cont) The computation is a quadratic programming problem.
We can obtain the solution path, piece-wise linear. Coefficients are non-linear in response y (they are linear in y in Ridge Regression) Regularization parameter is chosen by cross validation.

LASSO and Ridge Contour of RRS in the space of ’s

Generalize to L-q norm as penalty
Minimize RSS subject to constraints on the l-q norm Equivalent to Min Bridge regression with ridge and LASSO as special cases (q=1, smallest value for convex region) For q=0, best subset regression For 0<q<1, it is not convex!

Contours of constant values of L-q norms

Why non-convex norms? LASSO is biased:
Nonconvex Penalty is necessary for unbiased estimator

Elastic Net as a compromise between Ridge and LASSO
(Zou and Hastie 2005)

The Group LASSO Group LASSO Group norm l1-l2 (also l1-l∞)
Every group of variable are simultaneously selected or dropped

Methods using Derived Directions
Principal Components Regression Partial Least Squares

Principal Components Regression
Motivation: leading eigen-vectors describe most of the variability in X Z1 Z2 X2 X1

Principal Components Regression
Zi and Zj are orthogonal now. The dimension is reduced. High correlation between independent variables are eliminated. Noises in X’s are taken off (hopefully). Computation: PCA + Regression

Partial Least Square Partial Least Squares (PLS)
Uses inputs as well as response y to form the directions Zm Seeks directions that have high variance and have high correlation with response y Popular in Chemometrics If original inputs are orthogonal, finds the LS after one step. Subsequent steps have no effect. Since the derived inputs use y, the estimates are non-linear functions of the response, when the inputs are not orthogonal The coefficients for original variables tend to shrink as fewer PLS directions are used Choice of M can be made via cross validation

Partial Least Square Algorithm

PCR vs. PLS Principal Component Regression choose directions:
Partial Least Square has m-th direction: variance tends to dominate, whence PLS is close to Ridge

Ridge, PCR and PLS The solution path of different methods in a two variable (corr(X1,X2)=ρ, β=(4,2)) Regression case.

Comparisons on Estimated Prediction Errors (Prostate Cancer Example)
0.574 (0.156) 0.540 (0.168) 0.636 (0.172) 0.491 (0.152) 0.527 (0.122) Least Square: Test error: 0.586 Sd of error:(0.184)

LASSO and Forward Stagewise

Diabetes Data

LASSO and Forward Stagewise

Least Angel Regression (LARS)
Efron, Hastie, Johnstone, and Tibshirani (2003)

Recall: Forward Stagewise Selection
(Incremental) Forward stagewise Standardize the input variables Note:

LAR directions and Example
y

Relationship between those three
Lasso and forward stagewise can be thought of as restricted version of LARS For Lasso: Start with LARS; If a coefficient crosses zero, stop. Drop that predictor, recompute the best direction and continue. It gives the LASSO path. For Stagewise, Start with LARS; select the most correlated direction at each stage, go that direction with a small step. There are other related methods: Orthogonal Matching Pursuit Linearized Bregman Iteration

Homework Project I Keyword Pricing (regression)

Homework Project II Click Prediction (classification): two subproblems
click/impression click/bidding Data Directory: /data/ipinyou/ Files: bid txt: Bidding log file, 1.2M rows, 470MB imp txt: Impression log, 0.8M rows, 360MB clk txt: Click log file, 796 rows, 330KB data.zip: compressed files above (Password: ipinyou2013) dsp_bidding_data_format.pdf: format file Region&citys.txt: Region and City code Questions:

Homework Project II Data Input by R:
bid <- read.table("/Users/Liaohairen/DSP/bid txt", sep='\t', comment.char='') imp <- read.table("/Users/Liaohairen/DSP/imp txt", sep='\t', comment.char='’) R read.table by default uses '#' ascomment character, that is , it has the comment.char = '#' parameter, but the user-agent field may have '#' character.  To read correctly, turning off of interpretation of comments by setting comment.char='' is needed.

Homework Project III Heart Operation Effect Prediction (classification) Note: Large amount missing values

Download ppt "Lecture 4. Linear Models for Regression"

Similar presentations