Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jeremy Watt and Aggelos Katsaggelos Northwestern University

Similar presentations


Presentation on theme: "Jeremy Watt and Aggelos Katsaggelos Northwestern University"— Presentation transcript:

1 Sparse and low-rank recovery problems in signal processing and machine learning
Jeremy Watt and Aggelos Katsaggelos Northwestern University Department of EECS

2 Part 3: Accelerated Proximal Gradient Methods

3 Why learn this? More widely applicable than
Greedy methods: narrow in problem type, broad in scale Smooth reformulations: broad in problem type, narrow in scale Accelerated” part makes the methods very scalable “Proximal” part is natural extension of standard gradient descent scheme Used as a sub-routine in primal-dual approaches

4 Contents The “Accelerated” part The “Proximal” part
Nestorov’s optimal gradient step is often used since functions often dealt with are convex The “Proximal” part Standard gradient descent: proximal definition Natural extensions to sparse and low-rank problems

5 The “Accelerated” part

6 Gradient descent algorithm
The recpirical of the Lipshitz constant is typically used as the step-length for convex functions. e.g. if f(x) = |Ax – b|2 you’ll have L = |A’A|2

7 Gradient steps towards the optimum
in the valley of a long narrow tube

8 Gradient steps towards the optimum
with a “momentum” term added to cancel out the perpendicular “noise” and prevent zig-zagging3,4

9 standard gradient

10 momentum term

11 evens out sideways “noise”

12 Nesterov’s optimal gradient method4,5
evens out sideways “noise”

13 Nesterov’s optimal gradient method4,5
evens out sideways “noise” order of magnitude faster than standard gradient descent!

14 Optimal gradient descent algorithm
The gradient of f is often assumed Lipchitz continuous, but this isn’t required. There are many variations on this theme.

15 The “Accelerated” piece of the Proximal Gradient Method we’ll see next1
Replace the standard gradient in the proximal methods discussed next to make it “Accelerated” We’ll stick with the gradient for ease of exposition in the introduction to proximal methods next. Replacing the gradient step in each of the final Proximal Gradient approaches gives the corresponding Accelerated Proximal Gradient approach.

16 The “Proximal” part

17 Projection onto a convex set
This will become a familiar shape

18 Gradient step: proximal definition
2nd order “almost” Taylor series expansion

19 Gradient step: proximal definition
a little rearranging To go from the first to second line just throw away terms independent of x and complete the square. where

20 Proximal Gradient step
a simple projection! minimized at

21 Proximal Gradient step
a simple projection! again notice

22 Extension to L-1 regularized problems

23 Convex L-1 regularized problems
e.g. the Lasso

24 Proximal Gradient step
drag and drop It’s not clear how to generalize the notion of a gradient step to the L-1 case from the standard perspective. However from the proximal perspective it is fairly straightforward. same quadratic approximation to f

25 Proximal Gradient step
same business as before

26 Proximal Gradient step
same shape as proximal version of projection

27 Proximal Gradient step
at this point expect for some

28 Shrinkage operator

29 Proximal Gradient step
same business as before

30 Proximal Gradient Algorithm for general L-1 Regularized problem
Complexity: just like gradient descent.

31 Iterative Shrinkage Thresholding Algorithm (ISTA)
1 Complexity: just like gradient descent.

32 Iterative Shrinkage Thresholding Algorithm (ISTA)
With the optimal gradient step known as the Fast Iterative Shrinkage Thresholding Algorithm (FISTA)1 Complexity: just like gradient descent.

33 Extension to nuclear-norm regularized problems

34 Problem archetype proximal gradient

35 Problem archetype drag and drop same quadratic approximation to f

36 Problem archetype same shape as proximal version of projection

37 Problem archetype As usual expect

38 What is X*? Well if the SVD decomposition can be written in outer-product form as

39 What is X*? then since and

40 What is X*? could in analogy to ISTA be given by soft-thresholding the singular values of Y? i.e.

41 What is X*? could in analogy to ISTA be given by soft-thresholding the singular values of Y? i.e. (since )

42 Yes, it is2

43 Example: RPCA w/no Gaussian noise
Moral to continue to drive home: reformulating is half the battle in optimization. This is an example use of the Quadratic Penalty Method

44 Example: RPCA w/no Gaussian noise
Quadratic penalty method Moral to continue to drive home: reformulating is half the battle in optimization.

45 RPCA: reformulation via Quadratic Penalty Method
Perform alternating minimization in X and E using proximal gradient steps Moral to continue to drive home: reformulating is half the battle in optimization.

46 Both of these are in the bag

47 Demo: PG versus APG

48 Where to go from here Primal dual methods
Top of the line algorithms in the field for a wide array of large-scale sparse/low-rank problems Often employ proximal methods Rapid convergence to “reasonable” solutions, often good enough for sparse/low-rank problems Dual ascent7, Augmented Lagrangian8/Alternating Direction Method of Multipliers9

49 References Beck, Amir, and Marc Teboulle. "A fast iterative shrinkage-thresholding algorithm for linear inverse problems." SIAM Journal on Imaging Sciences 2.1 (2009): Cai, Jian-Feng, Emmanuel J. Candès, and Zuowei Shen. "A singular value thresholding algorithm for matrix completion." SIAM Journal on Optimization 20.4 (2010): Qian, Ning. "On the momentum term in gradient descent learning algorithms." Neural networks 12.1 (1999): Emmanuel J. Candès. Class notes for “Math 301: Advanced topics in convex optimization” found online at Nesterov, Yurii, and I͡U E. Nesterov. Introductory lectures on convex optimization: A basic course. Vol. 87. Springer, 2004. Made with GeoGebra – a free tool for producing geometric figures. Available for download online at

50 References Cai, Jian-Feng, Emmanuel J. Candès, and Zuowei Shen. "A singular value thresholding algorithm for matrix completion." SIAM Journal on Optimization 20.4 (2010): Lin, Zhouchen, Minming Chen, and Yi Ma. "The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices." arXiv preprint arXiv: (2010). Boyd, Stephen, et al. "Distributed optimization and statistical learning via the alternating direction method of multipliers." Foundations and Trends® in Machine Learning 3.1 (2011):


Download ppt "Jeremy Watt and Aggelos Katsaggelos Northwestern University"

Similar presentations


Ads by Google