Download presentation

Presentation is loading. Please wait.

Published byYahir Lumb Modified over 2 years ago

1
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 3: Regularization, Sparsity & Lasso 1

2
Homework 1 Check course website! Some coding required Some plotting required – I recommend Matlab Has supplementary datasets Submit via Moodle (due Jan 20 th @5pm) 2

3
Recap: Complete Pipeline Training DataModel Class(es) Loss Function Cross Validation & Model SelectionProfit! 3

4
Different Model Classes? Cross Validation & Model Selection Option 1: SVMs vs ANNs vs LR vs LS Option 2: Regularization 4

5
Notation L0 Norm – # of non-zero entries L1 Norm – Sum of absolute values L2 Norm & Squared L2 Norm – Sum of squares – Sqrt(sum of squares) L-infinity Norm – Max absolute value 5

6
Notation Part 2 Minimizing Squared Loss – Regression – Least-Squares – (Unless Otherwise Stated) E.g., Logistic Regression = Log Loss 6

7
Ridge Regression aka L2-Regularized Regression Trades off model complexity vs training loss Each choice of λ a “model class” – Will discuss the further later Regularization Training Loss 7

8
PersonAge>10Male?Height > 55” Alice101 Bob010 Carol000 Dave111 Erin101 Frank011 Gena000 Harold111 Irene100 John011 Kelly101 Larry111 Test 0.7401 0.2441 -0.1745 0.7122 0.2277 -0.1967 0.6197 0.1765 -0.2686 0.4124 0.0817 -0.4196 0.1801 0.0161 -0.5686 0.0001 0.0000 -0.6666 Train w b … Larger Lambda 8

9
Updated Pipeline Training DataModel Class Loss Function Cross Validation & Model SelectionProfit! Choosing λ! 9

10
PersonAge>10Male ? Height > 55” Alice1010.910.890.830.750.67 Bob0100.420.450.500.580.67 Carol0000.170.260.420.500.67 Dave1111.161.060.910.830.67 Erin1010.910.890.830.790.67 Frank0110.420.450.500.540.67 Gena0000.170.270.420.500.67 Harold1111.161.060.910.830.67 Irene1000.910.890.830.790.67 John0110.420.450.500.540.67 Kelly1010.910.890.830.790.67 Larry1111.161.060.910.830.67 Test Train Model Score w/ Increasing Lambda Best test error 10

11
25 dimensional space Randomly generated linear response function + noise Choice of Lambda Depends on Training Size 11

12
Recap: Ridge Regularization Ridge Regression: – L2 Regularized Least-Squares Large λ more stable predictions – Less likely to overfit to training data – Too large λ underfit Works with other loss – Hinge Loss, Log Loss, etc. 12

13
Model Class Interpretation This is not a model class! – At least not what we’ve discussed... An optimization procedure – Is there a connection? 13

14
Norm Constrained Model Class c=1 c=2 c=3 Visualization Seems to correspond to lambda… 14

15
Lagrange Multipliers http://en.wikipedia.org/wiki/Lagrange_multiplier Omitting b & 1 training data for simplicity 15

16
Norm Constrained Model Class Training: Observation about Optimality: Lagrangian: Optimality Implication of Lagrangian: Claim: Solving Lagrangian Solves Norm-Constrained Training Problem Two Conditions Must Be Satisfied At Optimality . Satisfies First Condition! Omitting b & 1 training data for simplicity 16 http://en.wikipedia.org/wiki/Lagrange_multiplier

17
Norm Constrained Model Class Training: Observation about Optimality: Lagrangian: Optimality Implication of Lagrangian: Claim: Solving Lagrangian Solves Norm-Constrained Training Problem Two Conditions Must Be Satisfied At Optimality . Satisfies First Condition! Optimality Implication of Lagrangian: Satisfies 2 nd Condition! Omitting b & 1 training data for simplicity 17 http://en.wikipedia.org/wiki/Lagrange_multiplier

18
Lagrangian: Norm Constrained Model Class Training: L2 Regularized Training: Omitting b & 1 training data for simplicity Lagrangian = Norm Constrained Training: Lagrangian = L2 Regularized Training: – Hold λ fixed – Equivalent to solving Norm Constrained! – For some c 18 http://en.wikipedia.org/wiki/Lagrange_multiplier

19
Recap #2: Ridge Regularization Ridge Regression: – L2 Regularized Least-Squares = Norm Constrained Model Large λ more stable predictions – Less likely to overfit to training data – Too large λ underfit Works with other loss – Hinge Loss, Log Loss, etc. 19

20
Hallucinating Data Points Instead hallucinate D data points? Omitting b & for simplicity Identical to Regularization! Unit vector along d-th Dimension 20

21
Extension: Multi-task Learning 2 prediction tasks: – Spam filter for Alice – Spam filter for Bob Limited training data for both… – … but Alice is similar to Bob 21

22
Extension: Multi-task Learning Two Training Sets – N relatively small Option 1: Train Separately Omitting b & for simplicity Both models have high error. 22

23
Extension: Multi-task Learning Two Training Sets – N relatively small Option 2: Train Jointly Omitting b & for simplicity Doesn’t accomplish anything! (w & v don’t depend on each other) 23

24
Multi-task Regularization Prefer w & v to be “close” – Controlled by γ – Tasks similar Larger γ helps! – Tasks not identical γ not too large Standard Regularization Multi-task Regularization Training Loss Test Loss (Task 2) 24

25
Lasso L1-Regularized Least-Squares 25

26
L1 Regularized Least Squares L2: L1: vs = = Omitting b & for simplicity 26

27
Subgradient (sub-differential) Differentiable: L1: Continuous range for w=0! Omitting b & for simplicity 27

28
L1 Regularized Least Squares L2: L1: Omitting b & for simplicity 28

29
Lagrange Multipliers Omitting b & 1 training data for simplicity Solutions tend to be at corners! 29 http://en.wikipedia.org/wiki/Lagrange_multiplier

30
Sparsity w is sparse if mostly 0’s: – Small L0 Norm Why not L0 Regularization? – Not continuous! L1 induces sparsity – And is continuous! Omitting b & for simplicity 30

31
Why is Sparsity Important? Computational / Memory Efficiency – Store 1M numbers in array – Store 2 numbers per non-zero (Index, Value) pairs E.g., [ (50,1), (51,1) ] – Dot product more efficient: Sometimes true w is sparse – Want to recover non-zero dimensions 31

32
Lasso Guarantee Suppose data generated as: Then if: With high probability (increasing with N): See also: https://www.cs.utexas.edu/~pradeepr/courses/395T-LT/filez/highdimII.pdf http://www.eecs.berkeley.edu/~wainwrig/Papers/Wai_SparseInfo09.pdf High Precision Parameter Recovery Sometimes High Recall 32

33
PersonAge>10Male?Height > 55” Alice101 Bob010 Carol000 Dave111 Erin101 Frank011 Gena000 Harold111 Irene100 John011 Kelly101 Larry111 33

34
Recap: Lasso vs Ridge Model Assumptions – Lasso learns sparse weight vector Predictive Accuracy – Lasso often not as accurate – Re-run Least Squares on dimensions selected by Lasso Easy of Inspection – Sparse w’s easier to inspect Easy of Optimization – Lasso somewhat trickier to optimize 34

35
Recap: Regularization L2 L1 (Lasso) Multi-task [Insert Yours Here!] 35 Omitting b & for simplicity

36
Next Lecture: Recent Applications of Lasso Cancer Detection Personalization via twitter Recitation on Wednesday: Probability & Statistics Image Sources: http://statweb.stanford.edu/~tibs/ftp/canc.pdf https://dl.dropboxusercontent.com/u/16830382/papers/badgepaper-kdd2013.pdf 36

Similar presentations

Presentation is loading. Please wait....

OK

Collaborative Filtering Matrix Factorization Approach

Collaborative Filtering Matrix Factorization Approach

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on obesity diet and exercise Ppt on duty roster meaning Ppt on standing order act registration Ppt on personality development for school students Funny ppt on motivation Ppt on power transmission and distribution Ppt on pill camera technology Ppt on idioms and phrases Ppt on business planning process Ppt on op amp circuits offset