Newton’s Method applied to a scalar function Newton’s method for minimizing f(x): Twice differentiable function f(x), initial solution x 0. Generate a.

Presentation on theme: "Newton’s Method applied to a scalar function Newton’s method for minimizing f(x): Twice differentiable function f(x), initial solution x 0. Generate a."— Presentation transcript:

Newton’s Method applied to a scalar function Newton’s method for minimizing f(x): Twice differentiable function f(x), initial solution x 0. Generate a sequence of solutions x 1, x 2, …and stop if the sequence converges to a solution with  f(x)=0. 1.Solve -  f(x k ) ≈  2 f(x k )  x 2. Let x k+1 =x k +  x. 3. let k=k+1

Newton’s Method applied to LS Not directly applicable to most nonlinear regression and inverse problems (not equal # of model parameters and data points, no exact solution to G(m)=d). Instead we will use N.M. to minimize a nonlinear LS problem, e.g. fit a vector of n parameters to a data vector d. f(m)=∑ [(G(m) i -d i )/  i ] 2 Let f i (m)=(G(m) i -d i )/  i i=1,2,…,m, F(m)=[f 1 (m) … f m (m)] T So that f(m)= ∑ f i (m) 2  f(m)=∑  f i (m) 2 ] m i=1 m i=1 m i=1

NM: Solve -  f(m k ) ≈  2 f(m k )  m LHS:  f(m k ) j = -∑ 2  f i (m k ) j F(m) j = -2 J(m k )F(m k ) RHS:  2 f(m k )  m = [2J(m) T J(m)+Q(m)]  m, where Q(m)= 2 ∑ f i (m)   f i (m) -2 J(m k )F(m k ) = 2 H(m)  m  m = -H -1 J(m k )F(m k ) = -H -1  f(m k ) (eq. 9.19) H(m)= 2J(m) T J(m)+Q(m) f i (m)=(G(m) i -d i )/  i i=1,2,…,m, F(m)=[f 1 (m) … f m (m)] T

Gauss-Newton (GN) method  2 f(m k )  m = H(m)  m = [2J(m k ) T J(m k )+Q(m)]  m ignores Q(m)=2∑ f i (m)   f i (m) :   f(m)≈2J(m) T J(m), assuming f i (m) will be reasonably small as we approach m*. That is, Solve -  f(x k ) ≈  2 f(x k )  x  f(m) j =∑ 2  f i (m) j F(m) j, i.e. J(m k ) T J(m k )  m=-J(m k ) T F(m k ) f i (m)=(G(m) i -d i )/  i i=1,2,…,m, F(m)=[f 1 (m) … f m (m)] T

Newton’s Method applied to LS Levenberg-Marquardt (LM) method uses [J(m k ) T J(m k )+ I]  m=-J(m k ) T F(m k ) ->0 : GN ->large, steepest descent (SD) (down-gradient most rapidly). SD provides slow but certain convergence. Which value of to use? Small values when GN is working well, switch to larger values in problem areas. Start with small value of, then adjust.

Statistics of iterative methods Cov(Ad)=A Cov(d) A T (d has multivariate N.D.) Cov(m L2 )=(G T G) -1 G T Cov(d) G(G T G) -1 Cov(d)=  2 I: Cov(m L2 )=  2 (G T G) -1 However, we don’t have a linear relationship between data and estimated model parameters for the nonlinear regression, so cannot use these formulas. Instead: F(m*+  m)≈F(m*)+J(m*)  m Cov(m*)≈(J(m*) T J(m*)) -1 not exact due to linearization, confidence intervals may not be accurate :) r i =G(m*) i -d i,  i =1 s=[∑ r i 2 /(m-n)] Cov(m*)=s 2 (J(m*) T J(m*)) -1 establish confidence intervals,  2

Implementation Issues 1.Explicit (analytical) expressions for derivatives 2.Finite difference approximation for derivatives 3.When to stop iterating?  f(m)~0 ||m k+1 -m k ||~0, |f(m) k+1 -f(m) k |~0 eqs 9.47-9.49 4. Multistart method to optimize globally

Iterative Methods SVD impractical when matrix has 1000’s of rows and columns e.g., 256 x 256 cell tomography model, 100,000 ray paths, < 1% ray hits 50 Gb storage of system matrix, U ~ 80 Gb, V ~ 35 Gb Waste of storage when system matrix is sparse Iterative methods do not store the system matrix Provide approximate solution, rather than exact using Gaussian elimination Definition iterative method: Starting point x 0, do steps x 1, x 2, … Hopefully converge toward right solution x

Iterative Methods Kaczmarz’s algorithm: Each of m rows of G i.m = d i define an n-dimensional hyperplane in R m 1)Project initial m(0) solution onto hyperplane defined by first row of G 2)Project m(1) solution onto hyperplane defined by second row of G 3)… until projections have been done onto all hyperplanes 4)Start new cycle of projections until converge If Gm=d has a unique solution, Kaczmarz’s algorithm will converge towards this If several solutions, it will converge toward the solution closest to m(0) If m(0)=0, we obtain the minimum length solution If no exact solution, converge fails, bouncing around near approximate solution If hyperplanes are nearly orthogonal, convergence is fast If hyperplanes are nearly parallel, convergence is slow

Conjugate Gradients Method Symmetric, positive definite system of equations Ax=b min  (x) = min(1/2 x T Ax - b T x)  (x) = Ax - b = 0 or Ax = b CG method: construct basis p 0, p 1, …, p n-1 such that p i T Ap j =0 when i≠j. Such basis is mutually conjugate w/r to A. Only walk once in each direction and minimize x = ∑  i p i - maximum of n steps required!  (  ) = 1/2 [ ∑  i p i ] T A ∑  i p i - b T [∑  i p i ] = 1/2 ∑ ∑  i  j p i T A p j - b T [∑  i p i ] (summations 0->n-1) n-1 i=0

Conjugate Gradients Method p i T Ap j =0 when i≠j. Such basis is mutually conjugate w/r to A. (p i are said to be ‘A orthogonal’)  (  ) = 1/2 ∑ ∑  i  j p i T A p j - b T [∑  i p i ] = 1/2 ∑  i  p i T A p i - b T [∑  i p i ] = 1/2 (∑  i  p i T A p i - 2  i b T p i ) - n independent terms Thus min  (  ) by minimizing i th term  i  p i T A p i - 2  i b T p i ->diff w/r  i and set derivative to zero:  i = b T p i / p i T A p i i.e., IF we have mutually conjugate basis, it is easy to minimize  (  )

Conjugate Gradients Method CG constructs sequence of x i, r i =b-Ax i, p i Start: x 0, r 0 =b, p 0 =r 0,  0 =r 0 T r 0 /p 0 T Ap 0 Assume at k th iteration, we have x 0, x 1, …, x k ; r 0, r 1, …, r k ; p 0, p 1, …, p k ;  0,  1, …,  k Assume first k+1 basis vectors p i are mutually conjugate to A, first k+1 r i are mutually orthogonal, and r i T p j =0 for i ≠j Let x k+1 = x k +  k p k and r k+1 = r k -  k Ap k which updates correctly, since: r k+1 = b - Ax k+1 = b - A(x k +  k p k ) = (b-Ax k ) -  k Ap k = r k -  k Ap k

Conjugate Gradients Method Let x k+1 = x k +  k p k and r k+1 = r k -  k Ap k  k+1 = r k+1 T r k+1 /r k T r k p k+1 = r k+1 +  k+1 p k b T p k =r k T r k (eq 6.33-6.38) Now we need proof of the assumptions 1)r k+1 is orthogonal to r i for i≤k (eq 6.39-6.43) 2)r k+1 T r k =0 (eq 6.44-6.48) 3)r k+1 is orthogonal to p i for i≤k (eq 6.49-6.54) 4)p k+1 T Ap i = 0 for i≤k (eq 6.55-6.60) 5)i=k: p k+1 T Ap k = 0 ie CG generates mutually conjugate basis

Conjugate Gradients Method Thus shown that CG generates a sequence of mutually conjugate basis vectors. In theory, the method will find an exact solution in n iterations. Given positive definite, symmetric system of eqs Ax=b, initial solution x 0, let  0 =0, p -1 =0,r 0 =b-Ax 0, k=0 1.If k>0, let  k = r k T r k /r k-1 T r k-1 2.Let p k = r k +  k p k-1 3.Let  k = r k T r k / p k T A p k 4. Let x k+1 = x k +  k p k 5. Let r k+1 = r k -  k Ap k 6. Let k=k+1

Conjugate Gradients Least Squares Method CG can only be applied to positive definite systems of equations, thus not applicable to general LS problems. Instead, we can apply the CGLS method to min ||Gm-d|| 2 G T Gm=G T d

Conjugate Gradients Least Squares Method G T Gm=G T d r k = G T d-G T Gm k = G T (d-Gm k ) = G T s k s k+1 =d-Gm k+1 =d-G(m k +  k p k )=(d-Gm k )-  k Gp k =s k -  k Gp k Given a system of eqs Gm=d, k=0, m 0 =0, p -1 =0,  0 =0, r 0 =G T s 0. 1.If k>0, let  k = r k T r k /[r k-1 T r k-1 ] 2.Let p k = r k +  k p k-1 3.Let  k = r k T r k / [Gp k ] T [G p k ] 4. Let m k+1 = m k +  k p k 5. Let s k+1 =s k -  k Gp k 6. Let r k+1 = r k -  k Gp k 7. Let k=k+1; never computing G T G, only Gp k, G T s k+1

L 1 Regression LS (L 2 ) is strongly affected by outliers If outliers are due to incorrect measurements, the inversion should minimize their effect on the estimated model. Effects of outliers in LS is shown by rapid fall-off of the tails of the Normal Distribution In contrast the Exponential Distribution has a longer tail, implying that the probability of realizing data far from the mean is higher. A few data points several  from is much more probable if drawn from an exponential rather than from a normal distribution. Therefore methods based on exponential distributions are able to handle outliers better than methods based on normal distributions. Such methods are said to be robust.

L 1 Regression min ∑ [d i -(Gm) i ]/  i = min ||d w -G w m|| 1 thus more robust to outliers because error is not squared Example: repeating measurement m times: [1 1 … 1] T m =[d 1 d 2 … d m ] T m L2 = (G T G) -1 G T d = m -1 ∑ d i f(m) = ||d-Gm|| 1 = ∑ |d i -m| Non-differentiable if m=d i Convex, so local minima=global minima f’(m) = ∑ sgn(d i -m), sgn(x)=+1 if x>0, =-1 if x<0, =0 if x=0 =0 if half is +, half is - est = median, where 1/2 of data is est, 1/2 > est

L 1 Regression Finding min ||d w -G w m|| 1 is not trivial. Several methods available, such as IRLS, solving a series of LS problems converging to a 1-norm: r=d-Gm f(m) = ||d-Gm|| 1 = ||r|| 1 = ∑ |r i | non-differentiable if r i =0. At other points: f’(m) = ∂f(m)/∂m k = - ∑ G i,k sgn(r i ) = -∑ G i,k r i /|r i |  f(m) = -G T Rr = -G T R(d-Gm) R i,i =1/|r i |  f(m) = -G T R(d-Gm) = 0 G T RGm = G T Rd R depends on m, nonlinear system :( IRLS!

convolution S(t)=h(t)*f(t) = ∫h(t-k) f(k) dk = ∑ h t-k f k h 1 0 0 s 1 assuming h(t) and f(t) are of length h 2 h 1 0 s 2 5 and 3, respectively h 3 h 2 h 1 f 1 s 3 h 4 h 3 h 2 f 2 = s 4 h 5 h 4 h 3 f 3 s 5 0 h 5 h 4 s 6 0 0 h 5 s 7 Here, recursive solution is easy

convolution ‘Shaping’ filtering: A*x=D, D ‘desired’ response, A,D known a 1 a 2 a 3 a 0 a 1 a 2 a -1 a 0 a 1. The matrix  ij is formed by the auto-correlation of at with zero- lag values along the diagonal and auto-correlations of successively higher lags off the diagonal.  ij is symmetric of order n a 1 a 0 a -1 a 2 a 1 a 0 a 3 a 2 a 1.                   

convolution A T D becomes. a -1 a -2 a -3 a 0 a -1 a -2 a 1 a 0 a -1. The matrix c ij is formed by the cross-correlation of the elements of A and D. Solution: (A T A) -1 A T D =  -1 c. d -1 d 0 d 1. c 1 = c 0 c -1.

Example Find a filter, 3 elements long, that convolved with (2,1) produced (1,0,0,0): (2,1)*(f 1,f 2,f 3 )=(1,0,0,0) a -1 a -2 a -3 a 0 a -1 a -2 a 1 a 0 a -1. The matrix c ij is formed by the cross-correlation of the elements of A and D. Solution: (A T A) -1 A T D =  -1 c d -1 d 0 d 1. c 1 = c 0 c -1.

Download ppt "Newton’s Method applied to a scalar function Newton’s method for minimizing f(x): Twice differentiable function f(x), initial solution x 0. Generate a."

Similar presentations