# Machine Learning & Data Mining CS/CNS/EE 155 Lecture 3: Regularization, Sparsity & Lasso 1.

## Presentation on theme: "Machine Learning & Data Mining CS/CNS/EE 155 Lecture 3: Regularization, Sparsity & Lasso 1."— Presentation transcript:

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 3: Regularization, Sparsity & Lasso 1

Homework 1 Check course website! Some coding required Some plotting required – I recommend Matlab Has supplementary datasets Submit via Moodle (due Jan 20 th @5pm) 2

Recap: Complete Pipeline Training DataModel Class(es) Loss Function Cross Validation & Model SelectionProfit! 3

Different Model Classes? Cross Validation & Model Selection Option 1: SVMs vs ANNs vs LR vs LS Option 2: Regularization 4

Notation L0 Norm – # of non-zero entries L1 Norm – Sum of absolute values L2 Norm & Squared L2 Norm – Sum of squares – Sqrt(sum of squares) L-infinity Norm – Max absolute value 5

Notation Part 2 Minimizing Squared Loss – Regression – Least-Squares – (Unless Otherwise Stated) E.g., Logistic Regression = Log Loss 6

Ridge Regression aka L2-Regularized Regression Trades off model complexity vs training loss Each choice of λ a “model class” – Will discuss the further later Regularization Training Loss 7

PersonAge>10Male?Height > 55” Alice101 Bob010 Carol000 Dave111 Erin101 Frank011 Gena000 Harold111 Irene100 John011 Kelly101 Larry111 Test 0.7401 0.2441 -0.1745 0.7122 0.2277 -0.1967 0.6197 0.1765 -0.2686 0.4124 0.0817 -0.4196 0.1801 0.0161 -0.5686 0.0001 0.0000 -0.6666 Train w b … Larger Lambda 8

Updated Pipeline Training DataModel Class Loss Function Cross Validation & Model SelectionProfit! Choosing λ! 9

PersonAge>10Male ? Height > 55” Alice1010.910.890.830.750.67 Bob0100.420.450.500.580.67 Carol0000.170.260.420.500.67 Dave1111.161.060.910.830.67 Erin1010.910.890.830.790.67 Frank0110.420.450.500.540.67 Gena0000.170.270.420.500.67 Harold1111.161.060.910.830.67 Irene1000.910.890.830.790.67 John0110.420.450.500.540.67 Kelly1010.910.890.830.790.67 Larry1111.161.060.910.830.67 Test Train Model Score w/ Increasing Lambda Best test error 10

25 dimensional space Randomly generated linear response function + noise Choice of Lambda Depends on Training Size 11

Recap: Ridge Regularization Ridge Regression: – L2 Regularized Least-Squares Large λ  more stable predictions – Less likely to overfit to training data – Too large λ  underfit Works with other loss – Hinge Loss, Log Loss, etc. 12

Model Class Interpretation This is not a model class! – At least not what we’ve discussed... An optimization procedure – Is there a connection? 13

Norm Constrained Model Class c=1 c=2 c=3 Visualization Seems to correspond to lambda… 14

Lagrange Multipliers http://en.wikipedia.org/wiki/Lagrange_multiplier Omitting b & 1 training data for simplicity 15

Norm Constrained Model Class Training: Observation about Optimality: Lagrangian:  Optimality Implication of Lagrangian: Claim: Solving Lagrangian Solves Norm-Constrained Training Problem Two Conditions Must Be Satisfied At Optimality . Satisfies First Condition! Omitting b & 1 training data for simplicity 16 http://en.wikipedia.org/wiki/Lagrange_multiplier

Norm Constrained Model Class Training: Observation about Optimality: Lagrangian:  Optimality Implication of Lagrangian: Claim: Solving Lagrangian Solves Norm-Constrained Training Problem Two Conditions Must Be Satisfied At Optimality . Satisfies First Condition! Optimality Implication of Lagrangian:  Satisfies 2 nd Condition! Omitting b & 1 training data for simplicity 17 http://en.wikipedia.org/wiki/Lagrange_multiplier

Lagrangian: Norm Constrained Model Class Training: L2 Regularized Training: Omitting b & 1 training data for simplicity Lagrangian = Norm Constrained Training: Lagrangian = L2 Regularized Training: – Hold λ fixed – Equivalent to solving Norm Constrained! – For some c 18 http://en.wikipedia.org/wiki/Lagrange_multiplier

Recap #2: Ridge Regularization Ridge Regression: – L2 Regularized Least-Squares = Norm Constrained Model Large λ  more stable predictions – Less likely to overfit to training data – Too large λ  underfit Works with other loss – Hinge Loss, Log Loss, etc. 19

Hallucinating Data Points Instead hallucinate D data points? Omitting b & for simplicity Identical to Regularization! Unit vector along d-th Dimension 20

Extension: Multi-task Learning 2 prediction tasks: – Spam filter for Alice – Spam filter for Bob Limited training data for both… – … but Alice is similar to Bob 21

Extension: Multi-task Learning Two Training Sets – N relatively small Option 1: Train Separately Omitting b & for simplicity Both models have high error. 22

Extension: Multi-task Learning Two Training Sets – N relatively small Option 2: Train Jointly Omitting b & for simplicity Doesn’t accomplish anything! (w & v don’t depend on each other) 23

Multi-task Regularization Prefer w & v to be “close” – Controlled by γ – Tasks similar Larger γ helps! – Tasks not identical γ not too large Standard Regularization Multi-task Regularization Training Loss Test Loss (Task 2) 24

Lasso L1-Regularized Least-Squares 25

L1 Regularized Least Squares L2: L1: vs = = Omitting b & for simplicity 26

Subgradient (sub-differential) Differentiable: L1: Continuous range for w=0! Omitting b & for simplicity 27

L1 Regularized Least Squares L2: L1: Omitting b & for simplicity 28

Lagrange Multipliers Omitting b & 1 training data for simplicity Solutions tend to be at corners! 29 http://en.wikipedia.org/wiki/Lagrange_multiplier

Sparsity w is sparse if mostly 0’s: – Small L0 Norm Why not L0 Regularization? – Not continuous! L1 induces sparsity – And is continuous! Omitting b & for simplicity 30

Why is Sparsity Important? Computational / Memory Efficiency – Store 1M numbers in array – Store 2 numbers per non-zero (Index, Value) pairs E.g., [ (50,1), (51,1) ] – Dot product more efficient: Sometimes true w is sparse – Want to recover non-zero dimensions 31

Lasso Guarantee Suppose data generated as: Then if: With high probability (increasing with N): See also: https://www.cs.utexas.edu/~pradeepr/courses/395T-LT/filez/highdimII.pdf http://www.eecs.berkeley.edu/~wainwrig/Papers/Wai_SparseInfo09.pdf  High Precision Parameter Recovery Sometimes High Recall 32

PersonAge>10Male?Height > 55” Alice101 Bob010 Carol000 Dave111 Erin101 Frank011 Gena000 Harold111 Irene100 John011 Kelly101 Larry111 33

Recap: Lasso vs Ridge Model Assumptions – Lasso learns sparse weight vector Predictive Accuracy – Lasso often not as accurate – Re-run Least Squares on dimensions selected by Lasso Easy of Inspection – Sparse w’s easier to inspect Easy of Optimization – Lasso somewhat trickier to optimize 34

Recap: Regularization L2 L1 (Lasso) Multi-task [Insert Yours Here!] 35 Omitting b & for simplicity

Next Lecture: Recent Applications of Lasso Cancer Detection Personalization via twitter Recitation on Wednesday: Probability & Statistics Image Sources: http://statweb.stanford.edu/~tibs/ftp/canc.pdf https://dl.dropboxusercontent.com/u/16830382/papers/badgepaper-kdd2013.pdf 36

Download ppt "Machine Learning & Data Mining CS/CNS/EE 155 Lecture 3: Regularization, Sparsity & Lasso 1."

Similar presentations