Lecture 4. Linear Models for Regression

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Eigen Decomposition and Singular Value Decomposition
3.3 Hypothesis Testing in Multiple Linear Regression
General Linear Model With correlated error terms  =  2 V ≠  2 I.
A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
Chapter 28 – Part II Matrix Operations. Gaussian elimination Gaussian elimination LU factorization LU factorization Gaussian elimination with partial.
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Chapter Outline 3.1 Introduction
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Prediction with Regression
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Ch11 Curve Fitting Dr. Deshi Ye
Chapter 2: Lasso for linear models
The General Linear Model. The Simple Linear Model Linear Regression.
Data mining and statistical learning - lecture 6
Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
The Simple Linear Regression Model: Specification and Estimation
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
x – independent variable (input)
Chapter 10 Simple Regression.
Curve-Fitting Regression
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Chapter 11 Multiple Regression.
1 Linear Methods for Regression Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Ordinary least squares regression (OLS)
Data mining and statistical learning, lecture 3 Outline  Ordinary least squares regression  Ridge regression.
Linear and generalised linear models
Basics of regression analysis
Classification and Prediction: Regression Analysis
Separate multivariate observations
Today Wrap up of probability Vectors, Matrices. Calculus
Objectives of Multiple Regression
PATTERN RECOGNITION AND MACHINE LEARNING
Chapter 15 Modeling of Data. Statistics of Data Mean (or average): Variance: Median: a value x j such that half of the data are bigger than it, and half.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Basis Expansions and Regularization Part II. Outline Review of Splines Wavelet Smoothing Reproducing Kernel Hilbert Spaces.
R EGRESSION S HRINKAGE AND S ELECTION VIA THE L ASSO Author: Robert Tibshirani Journal of the Royal Statistical Society 1996 Presentation: Tinglin Liu.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
CpSc 881: Machine Learning
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Machine Learning 5. Parametric Methods.
Basis Expansions and Generalized Additive Models Basis expansion Piecewise polynomials Splines Generalized Additive Model MARS.
Regularized Least-Squares and Convex Optimization.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Chapter 7. Classification and Prediction
Deep Feedforward Networks
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Boosting and Additive Trees (2)
Roberto Battiti, Mauro Brunato
Linear Model Selection and regularization
Biointelligence Laboratory, Seoul National University
Simple Linear Regression
OVERVIEW OF LINEAR MODELS
Generally Discriminant Analysis
The loss function, the normal equation,
Basis Expansions and Generalized Additive Models (2)
Mathematical Foundations of BME Reza Shadmehr
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Lecture 4. Linear Models for Regression

Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized Least Square: Ridge Regression LASSO Elastic Nets (LASSO+Ridge)

Linear Methods for Regression Input (FEATURES) Vector: (p-dimensional) X = X1, X2, …, Xp Real Valued OUTPUT: Y Joint Distribution of (Y,X ) Function: Regression Function E(Y |X ) = f(X) Training Data : (x1, y1), (x2, y2), …, (xN, yN) for estimation of input-output relation f.

Linear Model f(x): Regression function or a good approximation LINEAR in Unknown Parameters(weights, coefficients)

Features Quantitative inputs Any arbitrary but known function of measured attributes Transformations of quantitative attributes: g(x), e.g., log, square, square-root etc. Basis expansions: e.g., a polynomial approximation of f as a function of X1 (Taylor Series expansion with unknown coefficients)

Features (Cont.) Qualitative (categorical) input G Dummy Codes: For an attribute with k categories, may use k codes j = 1,2, …, k, as indicators of the category (level) used. Together, this collection of inputs represents the effect of G through This is a set of level-dependent constants, since only one of the Xj equals one and others are zero

Features(cont) Interactions: 2nd or higher-order Interactions of some features, e.g., Feature vector for the ith case in training set (Example)

Generalized Linear Models: Basis Expansion Wide variety of flexible models Model for f is a linear expansion of basis functions Dictionary: Prescribed basis functions

Other Basis Functions Polynomial basis of degree s (Smooth functions Cs). Fourie Series (Band-limited functions, a compact subspace of C∞) Splines Piecewise polynomials of degree K between the knots, joined with continuity of degree K-1 at the knots (Sobolev Spaces). Wavelets (Besov Spaces) Radial Basis Functions: Symmetric p-dim kernels located at particular centroids f(|x-y|) Gaussian Kernel at each centroids And more … -- Curse of Dimensionality: p could be equal to or much larger than n.

Method of Least Squares Find coefficients that minimize Residual Sum of Squares, RSS(b) = RSS denotes the empirical risk over the training set. It doesn’t assure the predictive performance over all inputs of interest.

Min RSS Criterion Statistically Reasonable provided Examples in the Training Set Large # of independent random draws from the inputs population for which prediction is desirable. Given inputs (x1, x2, …, xN), the outputs (y1, y2, …, yN) conditionally independent In principle, predictive performance over the set of future input vectors should be examined. Gaussian Noise: the Least Squares method equivalent to Max Likelihood Min RSS(b) over b in R(p+1), a quadratic function of b. Optimal Solution: Take the derivatives with respect to elements of b, and set them equal to zero.

XT(Y-X b) = 0 or (XTX) b = XTY Optimal Solution The Hession (2nd derivative) of the criterion function is given by XTX. The optimal solution satisfies the normal equations XT(Y-X b) = 0 or (XTX) b = XTY For an unique solution, the matrix XTX must full rank.

Projection When the matrix XTX is full rank. the estimated response for the training set: H: Projection (Hat) matrix HY: Orthogonal Projection of Y on the space spanned by the columns of X Note: the projection is linear in Y

Geometrical Insight

Simple Univariate Regression One Variable with no intercept LS estimate inner product = cosine (angle between vectors x and y), a measure of similarity between y and x Residuals: projection on normal space Definition: “Regress b on a” Simple regression of response b and input a, with no intercept Estimate Residual “b adjusted for a” “b orthogonalized with respect to a”

Multiple Regression Multiple Regression:p>1 LS estimates different from simple univariate regression estimates, unless columns of input matrix X orthogonal, If then These estimates are uncorrelated, and Orthogonal inputs occur sometimes in balanced, designed experiments (experimental design). Observational studies, will almost never have orthogonal inputs Must “orthogonalize” them in order to have similar interpretation Use Gram-Schmidt procedure to obtain an orthogonal basis for multiple regression

Multiple Regression Estimates: Sequence of Simple Regressions Regression by Successive Orthogonalization: Initialize For, j=1, 2, …, p, Regress to produce coefficients and residual vectors Regress y on the residual vector for the estimate Instead of using x1 and x2, take x1 and z as features

Multiple Regression = Gram-Schmidt Orthogonalization Procedure The vector zp is the residual of the multiple regression of xp on all other inputs Successive z’s in the above algorithm are orthogonal and form an orthogonal basis for the columns space of X. The least squares projection onto this subspace is the usual By re-arranging the order of these variables, any input can be labeled as the pth variable. If is highly correlated with other variables, the residuals are quite small, and the coefficient has high variance.

Statistics Properties of LS Model Uncorrelated noise: Mean zero, Variance Then Noise estimation Model d.f. = p+1 (dimension of the model space) To Draw inferences on parameter estimates, we need assumptions on noise: If assume: then,

Gauss-Markov Theorem (The Gauss-Markov Theorem) If we have any linear estimator that is unbiased for aT β, that is, E(cT y)= aT β,then It says that, for inputs in row space of X, LS estimate have Minimum variance among all unbiased estimates.

Bias-Variance Tradeoff Mean square error of an estimator = variance + bias Least square estimator achieves the minimal variance among all unbiased estimators There are biased estimators to further reduce variance: Stein’s estimator, Shrinkage/Thresholding (LASSO, etc.) More complicated a model is, more variance but less bias, need a trade-off

Hypothesis Test Single Parameter test: βj=0, T-statistics where vj is the j-th diagonal element of V = (XTX)-1 Confidence interval , e.g. z1-0.0025=1.96 Group parameter: , F-statistics for nested models

Example R command: lm(Y ~ x1 + x2 + … +xp)

Rank Deficiency Penalized methods (!) X : rank deficient Normal equations has infinitely many solutions Hat matrix H, and the projection are unique. For an input in the row space of X, unique LS estimate. For an input, not in the row space of X, the estimate may change with the solution used. How to generalize to inputs outside the training set? Penalized methods (!)

Reasons for Alternatives to LS Estimates Prediction accuracy LS estimates have low bias but high variance when inputs are highly correlated Larger ESPE Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero. Small bias in estimates may yield a large decrease in variance Bias/var tradeoff may provide better predictive ability Better interpretation With a large number of input variables, like to determine a smaller subset that exhibit the strongest effects. Many tools to achieve these objectives Subset selection Penalized regression - constrained optimization

Best Subset Selection Method Algorithm: leaps & bounds Find the best subset corresponding to the smallest RSS for each size For each fixed size k, can also find a specified number of subsets close to the best For each fixed subset, obtain LS estimates Feasible for p ~ 40. Choice of optimal k based on model selection criteria to be discussed later

Other Subset Selection Procedures Larger p, Classical Forward selection (step-up), Backward elimination (step down) Hybrid forward-backward (step-wise) methods Given a model, these methods only provide local controls for variable selection or deletion Which current variable is least effective (candidate for deletion) Which variable not in the model is most effective (candidate for inclusion) Do not attempt to find the best subset of a given size Not too popular in current practice

Forward Stagewise Selection (Incremental) Forward stagewise Standardize the input variables Note:

Penalized Regression Instead of directly minimize the Residual Sum Square, The penalized regression usually take the form: where J(f) is the penalization term, usually penalize on the smoothness or complexity of the function f λ is chosen by cross-validation.

Model Assessment and Selection If we are in data-rich situation, split data into three parts: training, validation, and testing. Train Validation Test See chapter 7.1 for details

Cross Validation When sample size not sufficiently large, Cross Validation is a way to estimate the out of sample estimation error (or classification rate). Available Data Training Test Randomly split error1 Split many times and get error2, …, errorm ,then average over all error to get an estimate

Ridge Regression (Tikhonov Regularization) Prostate Cancer Example Ridge regression shrinks coefficients by imposing a penalty on their size Min a penalized RSS Here is complexity parameter, that controls the amount of shrinkage Larger its value, greater the amount of shrinkage Coefficients are shrunk towards zero Choice of penalty term based on cross validation

Ridge Regression (cont) Equivalent problem Min RSS subject to Lagrangian multiplier 1-1 correspondence between s and With many correlated variables, LS estimates can become unstable and exhibit high variance and high correlations A widely large positive coeff on one variable can be cancelled by a large negative coeff on another Imposing a size constraint on the coefficients, this phenomena is prevented from occurring Ridge solutions are not invariant under scaling of inputs Normally standardize the inputs before solving the optimization problem Since the penalty term does not include the bias term, estimate intercept by the mean of response y.

Ridge Regression (cont) The Ridge criterion Shrinkage: For orthogonal inputs, ridge: scaled version of LS estimates Ridge is mean or mode of posterior distribution of under a normal prior Centered input matrix X SVD of X: U and V are orthogonal matrices Columns of U span column space of X Columns of V span row space of X D: a diagonal matrix of singular values Eigen decomposition of The Eigen vectors : principal components directions of X (Karhunen-Loeve direction)

Ridge Regression and Principal Components First PC direction : Among all normalized linear combinations of columns of X, the has largest sample variance Derived variable, is first PC of X. Subsequent PC have max variance subject to being orthogonal to earlier ones. Last PC has min variance Effective Degree of Freedmon

Ridge Regression (Summary) Ridge Regression penalized the complexity of a linear model by the sum squares of the coefficients It is equivalent to minimize RRS given the constraints The matrix (XTX+ I) is always invertable. The penalization parameter  controls how simple “you” want the model to be.

Prostate Cancer Example

Ridge Regression (Summary) Prostate Cancer Example Solutions are not sparse in the coefficient space. - ’s are not zero almost all the time. The computation complexity is O(p3) when inversing the matrix XTX+ I.

Least Absolute Shrinkage and Selection Operator (LASSO) Penalized RSS with L1-norm penalty, or subject to constraint Shrinks like Ridge with L2- norm penalty, but LASSO coefficients hit zero, as the penalty increases.

LASSO as Penalized Regression Instead of directly minimize the Residual Sum Square, The penalized regression usually take the form: where

LASSO(cont) The computation is a quadratic programming problem. We can obtain the solution path, piece-wise linear. Coefficients are non-linear in response y (they are linear in y in Ridge Regression) Regularization parameter is chosen by cross validation.

LASSO and Ridge Contour of RRS in the space of ’s

Generalize to L-q norm as penalty Minimize RSS subject to constraints on the l-q norm Equivalent to Min Bridge regression with ridge and LASSO as special cases (q=1, smallest value for convex region) For q=0, best subset regression For 0<q<1, it is not convex!

Contours of constant values of L-q norms

Why non-convex norms? LASSO is biased: Nonconvex Penalty is necessary for unbiased estimator

Elastic Net as a compromise between Ridge and LASSO (Zou and Hastie 2005)

The Group LASSO Group LASSO Group norm l1-l2 (also l1-l∞) Every group of variable are simultaneously selected or dropped

Methods using Derived Directions Principal Components Regression Partial Least Squares

Principal Components Regression Motivation: leading eigen-vectors describe most of the variability in X Z1 Z2 X2 X1

Principal Components Regression Zi and Zj are orthogonal now. The dimension is reduced. High correlation between independent variables are eliminated. Noises in X’s are taken off (hopefully). Computation: PCA + Regression

Partial Least Square Partial Least Squares (PLS) Uses inputs as well as response y to form the directions Zm Seeks directions that have high variance and have high correlation with response y Popular in Chemometrics If original inputs are orthogonal, finds the LS after one step. Subsequent steps have no effect. Since the derived inputs use y, the estimates are non-linear functions of the response, when the inputs are not orthogonal The coefficients for original variables tend to shrink as fewer PLS directions are used Choice of M can be made via cross validation

Partial Least Square Algorithm

PCR vs. PLS Principal Component Regression choose directions: Partial Least Square has m-th direction: variance tends to dominate, whence PLS is close to Ridge

Ridge, PCR and PLS The solution path of different methods in a two variable (corr(X1,X2)=ρ, β=(4,2)) Regression case.

Comparisons on Estimated Prediction Errors (Prostate Cancer Example) 0.574 (0.156) 0.540 (0.168) 0.636 (0.172) 0.491 (0.152) 0.527 (0.122) Least Square: Test error: 0.586 Sd of error:(0.184)

LASSO and Forward Stagewise

Diabetes Data

LASSO and Forward Stagewise

Least Angel Regression (LARS) Efron, Hastie, Johnstone, and Tibshirani (2003)

Recall: Forward Stagewise Selection (Incremental) Forward stagewise Standardize the input variables Note:

LAR directions and Example y

Relationship between those three Lasso and forward stagewise can be thought of as restricted version of LARS For Lasso: Start with LARS; If a coefficient crosses zero, stop. Drop that predictor, recompute the best direction and continue. It gives the LASSO path. For Stagewise, Start with LARS; select the most correlated direction at each stage, go that direction with a small step. There are other related methods: Orthogonal Matching Pursuit Linearized Bregman Iteration

Homework Project I Keyword Pricing (regression)

Homework Project II Click Prediction (classification): two subproblems click/impression click/bidding Data Directory: /data/ipinyou/ Files: bid.20130301.txt: Bidding log file, 1.2M rows, 470MB imp.20130301.txt: Impression log, 0.8M rows, 360MB clk.20130301.txt: Click log file, 796 rows, 330KB data.zip: compressed files above (Password: ipinyou2013) dsp_bidding_data_format.pdf: format file Region&citys.txt: Region and City code Questions: dsp-competition@ipinyou.com

Homework Project II Data Input by R: bid <- read.table("/Users/Liaohairen/DSP/bid.20130301.txt", sep='\t', comment.char='') imp <- read.table("/Users/Liaohairen/DSP/imp.20130301.txt", sep='\t', comment.char='’) R read.table by default uses '#' ascomment character, that is , it has the comment.char = '#' parameter, but the user-agent field may have '#' character.  To read correctly, turning off of interpretation of comments by setting comment.char='' is needed. 

Homework Project III Heart Operation Effect Prediction (classification) Note: Large amount missing values