Penalized Regression, Part 3

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Chapter Outline 3.1 Introduction
Penalized Regression, Part 2
Ridge Regression Population Characteristics and Carbon Emissions in China ( ) Q. Zhu and X. Peng (2012). “The Impacts of Population Change on Carbon.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Penalized Regression BMTRY 726 3/7/14.
Review of Univariate Linear Regression BMTRY 726 3/4/14.
Model generalization Test error Bias, variance and complexity
Chapter 2: Lasso for linear models
The General Linear Model. The Simple Linear Model Linear Regression.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Ordinary least squares regression (OLS)
Comparison of Regularization Penalties Pt.2 NCSU Statistical Learning Group Will Burton Oct
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
Multiple Linear Regression Response Variable: Y Explanatory Variables: X 1,...,X k Model (Extension of Simple Regression): E(Y) =  +  1 X 1 +  +  k.
Objectives of Multiple Regression
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
Shrinkage Estimation of Vector Autoregressive Models Pawin Siriprapanukul 11 January 2010.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
Ensemble Learning (1) Boosting Adaboost Boosting is an additive model
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
R EGRESSION S HRINKAGE AND S ELECTION VIA THE L ASSO Author: Robert Tibshirani Journal of the Royal Statistical Society 1996 Presentation: Tinglin Liu.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
CpSc 881: Machine Learning
Assignments CS fall Assignment 1 due Generate the in silico data set of 2sin(1.5x)+ N (0,1) with 100 random values of x between.
Logistic Regression & Elastic Net
Regularized Least-Squares and Convex Optimization.
LECTURE 13: LINEAR MODEL SELECTION PT. 3 March 9, 2016 SDS 293 Machine Learning.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
Stats Methods at IC Lecture 3: Regression.
Linear Regression Methods for Collinearity
Bayesian Semi-Parametric Multiple Shrinkage
Chapter 7. Classification and Prediction
Penalized Regression Part 1
Deep Feedforward Networks
Large Margin classifiers
CHAPTER 7 Linear Correlation & Regression Methods
Boosting and Additive Trees (2)
CSE 4705 Artificial Intelligence
10701 / Machine Learning.
CJT 765: Structural Equation Modeling
Roberto Battiti, Mauro Brunato
Chapter 6: MULTIPLE REGRESSION ANALYSIS
Lasso/LARS summary Nasimeh Asgarian.
Chapter 3 Multiple Linear Regression
What is Regression Analysis?
OVERVIEW OF LINEAR MODELS
Linear Model Selection and regularization
SOME PROBLEMS THAT MIGHT APPEAR IN DATA
The Bias Variance Tradeoff and Regularization
OVERVIEW OF LINEAR MODELS
Multiple Linear Regression
CRISP: Consensus Regularized Selection based Prediction
Support Vector Machine I
Basis Expansions and Generalized Additive Models (2)
Basis Expansions and Generalized Additive Models (1)
Multivariate Linear Regression
Model generalization Brief summary of methods
Sparse Principal Component Analysis
Regression Analysis.
Penalized Regression, Part 2
Support Vector Machines 2
Alan Girling University of Birmingham, UK
Presentation transcript:

Penalized Regression, Part 3 BMTRY 790: Machine Learning Summer 2017

Limitations of the Lasso Many biological data we encounter have p > n Gene expression, clinical data warehouse data Variables in such data also are often correlated (grouped variables) lasso fails to do grouped selection. An ideal selection method should be able to Remove all unimportant variables Automatically include whole groups in a model once one member of the group is included It seems like a some combination of ridge and lasso regression might work….

Elastic Net Combines the ridge and lasso penalties in the objective function (Zou and Hastie, 2005) The L1 (lasso) part of the penalty generates a sparse model. The L2 (ridge) quadratic part of the penalty Removes the limitation on the number of selected variables Encourages grouping effect Stabilizes the L1 regularization path

Elastic Net Regularization Elastic Net penalty function Has singularity at the vertices (sparsity) Has strict convex edges (encourages grouping) Properties Simultaneously does variable selection and continuous shrinkage Can select groups of correlated variables The elastic net often outperforms the lasso in terms of prediction accuracy

Naïve Elastic Net Initialize Then: If we let then we can rewrite the solution as The impact of a: a = 1  ridge regression a = 0  lasso regression 0 < a < 1  elastic net

Geometry of Elastic Net Elastic Net Penalty: Singularities at the vertexes (necessary for sparsity) Strict convex edges. The strength of convexity varies with α (grouping).

Naïve Elastic Net Solution

Grouped Variables Grouped variables occur when predictors within the X matrix are highly correlated Microarray data (genes in the same pathway are likely grouped) Environmental contaminant data from similar sources Extreme case: Xi == Xj Ideally regression method identifies these variables as a group and assigns similar coefficient values (assuming scaled and centered) Simulations have shown Lasso performs poorly when predictors in X are highly collinear Ridge can handle collinearity BUT not variable selection

Elastic Net For Grouping

Corrected Elastic Net Solution Assume have data (y, X) and (l1, l2) and augmented data (y*, X*) Recall the naïve solution To avoid the “double” penalty, Zou and Hastie apply a correction

Computing Elastic Net Solution Recall, the full Lasso path could be calculated using a modified version of the least angle regression (LAR) algorithm Sequentially add predictors most correlated with residuals LAR: Once a variable is in the active set, it remains in the active set Modified LAR (i.e. lasso path): Variables that cross 0 removed from active set LARS-EN algorithm imposes an additional constraint Apply modified LARS algorithm but based on a fixed quadratic penalty term, l2

LARS-EN Algorithm LARS-EN algorithm: (1) Initialize l2 and model: with empty “active set” A0 (2) Find the most correlated with r and update the active set to (3) Move towards , until another covariate has the same correlation with r that does. Update active set to (4) Update r and move along towards the joint OLS direction for until a third covariate is as correlated with r as . Update active set to (4a) If a non-zero coefficient reaches 0, remove it from the active set and recalculate the current joint OLS direction (5) Continue until all p covariates have been added to the model

Elastic Net vs. Lasso Consider data with 2 “hidden” factors z1 and z2 With response y where y = z1 + 0.1z2 + e (e ~ N(0, 1/16)) We observe predictors Fit a model on (X, y) Ideally, an “oracle” model identifies x1, x2, and x3 as important

Lasso and Elastic Net Paths Lasso Path Elastic Net Path for l2 = 1

Tuning Elastic Net Models As with ridge and lasso regression, we need to tune our models for an appropriate choice of penalty parameter However, in elastic net we have two penalty parameters to consider We can think about parameters we need to tune in several ways Tune based on (l1, l2) Based on l2 and the L1-norm (i.e. ) Second parameter: L1-norm = t Or second parameter: fraction of L1-norm = s The second choice is common Alternatively based on l2 and the number of maximum number of steps This is a good choice if p > n and want to limit the number of predictors considered

Tuning Elastic Net Models We can choose any of the approaches and use a cross-validation approach Ridge we used a generalized cross validation approach (like leave one out) Lasso we used K-fold cross validation (common choices for K are 5 and 10) BUT we need to consider that we have a 2-D surface to optimize Could conduct a grid search with K-fold cross validation Select a set of fixed l2 (generally something like c(0, 0.01, 0.1, 1, 10, 100)) Then apply K-fold cross validation for each l2 based on a sequence of parameter 2 Select the combination of l2 and parameter 2 that yield the smallest CV error

Software Packages There are (at least) 2 packages in R that can fit glmnet, elasticnet The elasticnet package is based on the lars package we used for fitting lasso models Has build in cross-validation function Set l2 and then examine choice of either (a) fraction of L1-norm (b) maximum number of steps However, the glmnet package can fit an elastic net model However, appears to be based on the naïve elastic net approach

Body Fat Example Recall our regression model > summary(mod13) Call: lm(formula = PBF ~ ., data = bodyfat2) Estimate Std. Error t value Pr(>|t|) (Int) 0.000 3.241e-02 0.000 1.00000 Age 0.0935 4.871e-02 1.919 0.05618 . Wt -0.3106 1.880e-01 -1.652 0.09978 . Ht -0.0305 4.202e-02 -0.725 0.46925 Neck -0.1367 6.753e-02 -2.024 0.04405 * Chest -0.0240 9.988e-02 -0.241 0.81000 Abd 1.2302 1.114e-01 11.044 < 2e-16 *** Hip -0.1777 1.249e-01 -1.422 0.15622 Thigh 0.1481 9.056e-02 1.636 0.10326 Knee 0.0044 6.974e-02 0.063 0.94970 Ankle 0.0352 4.485e-02 0.786 0.43285 Bicep 0.0656 6.178e-02 1.061 0.28966 Arm 0.1091 4.808e-02 2.270 0.02410 * Wrist -0.1808 5.968e-02 -3.030 0.00272 ** Residual standard error: 4.28 on 230 degrees of freedom Multiple R-squared: 0.7444, Adjusted R-squared: 0.73 F-statistic: 51.54 on 13 and 230 DF, p-value: < 2.2e-16

Body Fat Example library(elasticnet) ### First conducting 10-fold CV to select our tuning parmaters ### par(mfrow=c(2,3)) set.seed(3210) menet0<-cv.enet(x=bodyfat2[,2:14], y=bodyfat2[,1], s=seq(0,1,length=100), lambda=0, mode="fraction") menet001<-cv.enet(x=bodyfat2[,2:14], y=bodyfat2[,1], s=seq(0,1,length=100), lambda=0.01, mode="fraction") menet01<-cv.enet(x=bodyfat2[,2:14], y=bodyfat2[,1], K=10, s=seq(0,1,length=100), lambda=0.1, mode="fraction") menet1<-cv.enet(x=bodyfat2[,2:14], y=bodyfat2[,1], K=10, s=seq(0,1,length=100), lambda=1, mode="fraction") menet10<-cv.enet(x=bodyfat2[,2:14], y=bodyfat2[,1], K=10, s=seq(0,1,length=100), lambda=10, mode="fraction") menet100<-cv.enet(x=bodyfat2[,2:14], y=bodyfat2[,1], K=10, s=seq(0,1,length=100), lambda=100, mode="fraction")

Body Fat Example Let’s look at one of the models >names(menet01) [1] "s" "cv" "cv.error" > menet01$s [1] 0.00000000 0.01010101 0.02020202 0.03030303 0.04040404 0.05050505 … [97] 0.96969697 0.97979798 0.98989899 1.00000000 > menet01$cv [1] 0.9979108 0.9561537 0.9157728 0.8767680 0.8391396 0.8028873 0.7680113 [99] 0.2876542 0.2879505 > menet01$cv.error [1] 0.08570741 0.08403771 0.08243508 0.08088946 0.07939103 0.07793024 [97] 0.02713857 0.02699345 0.02684893 0.02670769

Body Fat Example ### Looking at minimum cv error for each choice of lambda_2 > min(menet0$cv) [1] 0.2864688 > min(menet001$cv) [1] 0.2914921 > min(menet01$cv) [1] 0.3226485 > min(menet1$cv) [1] 0.4163464 > min(menet10$cv) [1] 0.4479238 > min(menet100$cv) [1] 0.4522492

Body Fat Example ### Selecting the fraction of L1-Norm from the model with the smallest cv.error > fracL1<-menet0$s[which(menet0$cv==min(menet0$cv))] > fracL1 0.8787879 ### fitting a model with our choice of lambda_2 > mod.enet<-enet(x=bodyfat2[,2:14], y=bodyfat2[,1], lambda=0) > plot(mod.enet, use.color=T, main="Body Fat Example") > abline(v=fracL1, lty=2, col=2, lwd=2)

Body Fat Example ### extracting the model based on the selected lambda_2 and fraction of L1 > predict(mod.enet, s=fracL1, type=“coefficients”) $s [1] 0.8787879 $fraction 0 0.8787879 $mode [1] "fraction" $coefficients Age Wt Ht Neck Chest Abd 0.09124856 -0.24657606 -0.03803826 -0.12674220 0.00000000 1.15383642 Hip Thigh Knee Ankle Bicep Arm -0.13079181 0.10911306 0.00000000 0.01882265 0.04902691 0.09836029 Wrist -0.17520256

Summary of Common Penalized Regression Penalized regression model take the general form Most common choices for these models include Ridge: Lasso: Elastic Net: None are “oracle” methods BUT they have useful properties

Summary of Common Penalized Regression All improve prediction via bias-variance trade-off Recall OLS solution is BLUE (among unbiased solutions) Penalized approaches increase bias but generally reduce variance In all of these approaches, solutions are not scale invariant Scale and center variables (and often outcome) before fitting Only exception are categorical responses Shrinkage/penalty parameters must also be “tuned” before selecting final model Generalized cross validation (relevant only to ridge regression) K-fold cross validation

Adaptations to Penalized Regression Models Methods designed to achieve oracle properties Choose Lq where 0 < q < 1 Constraint region no longer convex so harder to solve Smoothly Clipped Absolute Deviation (SCAD) penalty (Fan and Li 2001) Use of weighted sum of predictors in penalty Weighted and adaptive Lasso (Zou 2006; Zhang and Lu 2007) Nonnegative garrote

Adaptations to Penalized Regression Models Methods for correlated predictors Adaptive Elastic Net Group variable selection Grouped Lasso (and adaptive group lasso) supnorm penalty

Next Time Finalize our discussion of linear regression model fitting strategies Moving to classification (i.e. non-continuous outcome) Start our discussion of linear classifiers