Recitation4 for BigData Jay Gu Feb 7 2013 LASSO and Coordinate Descent.

Slides:



Advertisements
Similar presentations
Fast Regression Algorithms Using Spectral Graph Theory Richard Peng.
Advertisements

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Regularization David Kauchak CS 451 – Fall 2013.
Edge Preserving Image Restoration using L1 norm
Least squares CS1114
Martin Burger Total Variation 1 Cetraro, September 2008 Variational Methods and their Analysis Questions: - Existence - Uniqueness - Optimality conditions.
RECITATION 1 APRIL 14 Lasso Smoothing Parameter Selection Splines.
Optimization Tutorial
A Casual Chat on Convex Optimization in Machine Learning Data Mining at Iowa Group Qihang Lin 02/09/2014.
Chapter 9 Perceptrons and their generalizations. Rosenblatt ’ s perceptron Proofs of the theorem Method of stochastic approximation and sigmoid approximation.
Network Systems Lab. Korea Advanced Institute of Science and Technology No.1 Some useful Contraction Mappings  Results for a particular choice of norms.
Separating Hyperplanes
Machine learning optimization
Chapter 2: Lasso for linear models
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
RECITATION 1 APRIL 9 Polynomial regression Ridge regression Lasso.
The loss function, the normal equation,
Numerical Optimization
Linear Regression  Using a linear function to interpolate the training set  The most popular criterion: Least squares approach  Given the training set:
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
The Widrow-Hoff Algorithm (Primal Form) Repeat: Until convergence criterion satisfied return: Given a training set and learning rate Initial:  Minimize.
The Perceptron Algorithm (Dual Form) Given a linearly separable training setand Repeat: until no mistakes made within the for loop return:
Active Set Support Vector Regression
GRADIENT PROJECTION FOR SPARSE RECONSTRUCTION: APPLICATION TO COMPRESSED SENSING AND OTHER INVERSE PROBLEMS M´ARIO A. T. FIGUEIREDO ROBERT D. NOWAK STEPHEN.
On L1q Regularized Regression Authors: Han Liu and Jian Zhang Presented by Jun Liu.
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.
Auto-tuned high-dimensional regression with the TREX: theoretical guarantees and non-convex global optimization Jacob Bien 1, Irina Gaynanova 1, Johannes.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Scalable training of L1-regularized log-linear models
Mathematical formulation XIAO LIYING. Mathematical formulation.
Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
High-dimensional Error Analysis of Regularized M-Estimators Ehsan AbbasiChristos ThrampoulidisBabak Hassibi Allerton Conference Wednesday September 30,
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Ariadna Quattoni Xavier Carreras An Efficient Projection for l 1,∞ Regularization Michael Collins Trevor Darrell MIT CSAIL.
L25 Numerical Methods part 5 Project Questions Homework Review Tips and Tricks Summary 1.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
Mathematical Analysis of MaxEnt for Mixed Pixel Decomposition
Ridge Regression: Biased Estimation for Nonorthogonal Problems by A.E. Hoerl and R.W. Kennard Regression Shrinkage and Selection via the Lasso by Robert.
New inclusion functions in interval global optimization of engineering structures Andrzej Pownuk Chair of Theoretical Mechanics Faculty of Civil Engineering.
Differential Equations Linear Equations with Variable Coefficients.
Optimization in Engineering Design 1 Introduction to Non-Linear Optimization.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Scalable training of L1-regularized log-linear models
Matt Gormley Lecture 5 September 14, 2016
StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent
Deep Feedforward Networks
Multiplicative updates for L1-regularized regression
A Fast Trust Region Newton Method for Logistic Regression
Boosting and Additive Trees (2)
CSE 4705 Artificial Intelligence
Probabilistic Models for Linear Regression
Chap 3. The simplex method
Multiple Column Partitioned Min Max
Lasso/LARS summary Nasimeh Asgarian.
Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits
PEGASOS Primal Estimated sub-GrAdient Solver for SVM
CRISP: Consensus Regularized Selection based Prediction
The loss function, the normal equation,
How do we find the best linear regression line?
Mathematical Foundations of BME Reza Shadmehr
Least squares linear classifier
CS639: Data Management for Data Science
CIS 700: “algorithms for Big Data”
Multiple features Linear Regression with multiple variables
Multiple features Linear Regression with multiple variables
Batch Normalization.
Presentation transcript:

Recitation4 for BigData Jay Gu Feb LASSO and Coordinate Descent

A numerical example N = 50 P = 200 # Non zero coefficients = 5 X ~ normal (0, I) beta_1, beta_2, beta_3 ~ normal (1, 2) sigma ~ normal(0, 0.1*I) Y = Xbeta + sigma Split training vs testing: 80/20 Generate some synthetic data:

Practicalities Standardize your data: Center X, Y remove the intercept Scale X to have unit norm at each column fair regularization for all covariates Warm start. Run Lambdas from large to small, Starting from the largest lambda to be max(X’y) Guarantees to have zero support size.

Algorithm Ridge Regression: Closed form solution. LASSO: Iterative algorithms: Subgradient Descent Generalized Gradient Methods (ISTA) Accelerated Generalized Gradient Methods (FSTA) Coordinate Descent

Subdifferentials Coordinate Descent Slides from Ryan Tibshirani F12/slides/06-sg-method.pdf F12/slides/25-coord-desc.pdf

Coordintate Descent: always find global optimum? Convex and differentiable? Yes Convex and non-differentiable? No

Convex but separable non-differentiable parts? Yes. Proof:

CD for Linear Regression

Rate of Convergence? Assuming gradient is Lipchitz continuous. Subgradient Descent: 1/sqrt(k) Gradient Descent: 1/k Optimal rate for first order methods: 1/(k^2) Coordinate Descent: – Only know for some special cases

Summary: Coordinate Descent Good for large P No tuning parameter In practice converge much faster than the optimal first order methods Only applies to certain cases Unknown convergence rate for general function classes