Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 7 Advanced Topics in Least Squares. the multivariate normal distribution for data, d p(d) = (2  ) -N/2 |C d | -1/2 exp{ -1/2 (d-d) T C d -1 (d-d)

Similar presentations


Presentation on theme: "Lecture 7 Advanced Topics in Least Squares. the multivariate normal distribution for data, d p(d) = (2  ) -N/2 |C d | -1/2 exp{ -1/2 (d-d) T C d -1 (d-d)"— Presentation transcript:

1 Lecture 7 Advanced Topics in Least Squares

2 the multivariate normal distribution for data, d p(d) = (2  ) -N/2 |C d | -1/2 exp{ -1/2 (d-d) T C d -1 (d-d) } Let’s assume that the expectation d Is given by a general linear model d = Gm And that the covariance C d is known (prior covariance)

3 Then we have a distribution P(d; m) with unknown parameters, m p(d)=(2  ) -N/2 |C d | -1/2 exp{ -½ (d-Gm) T C d -1 (d-Gm) } We can now apply the principle of maximum likelihood To estimate the unknown parameters m

4 Principle of Maximum Likelihood Last lecture we stated this principle as L(m) =  i ln p(d i ; m) with respect to m but in this distribution the whole data vector d is being treated as a single quantity So the princple becomes simply Maximize L(m) = ln p(d; m) p(d;m)=(2  ) -N/2 |C d | -1/2 exp{ -½ (d-Gm) T C d -1 (d-Gm) }

5 L(m) = ln p(d; m) = - ½Nln (2  ) - ½ln (|C d |) - ½(d-Gm) T C d -1 (d-Gm) The first two terms do not contain m, so the principle of maximum likelihood is Maximize -½ (d-Gm) T C d -1 (d-Gm) or Minimize (d-Gm) T C d -1 (d-Gm)

6 Special case of uncorrelated data with equal variance C d =  d 2 I Minimize  d -2 (d-Gm) T (d-Gm) with respect to m Which is the same as Minimize (d-Gm) T (d- Gm) with respect to m This is the Principle of Least Squares

7 minimize E = e T e = (d-Gm) T (d-Gm) with respect to m follows from the Principle of Maximum Likelihood in the special case of a multivariate Normal distribution the data being uncorrelated and of equal variance

8 Corollary If your data are NOT NORMALLY DISTRIBUTED Then least-squares is not the right method to use!

9 What if C d =  d 2 I but  d is unknown? note |C d | =  2N L(m,  d ) = -½Nln(2  ) - ½ln(|C d |) - ½(d-Gm) T C d -1 (d-Gm) = -½Nln(2  ) – Nln(  d ) - ½  d -2 (d-Gm) T (d-Gm) The first two terms do not contain m, so the principle of maximum likelihood still implies: Minimize (d-Gm) T (d-Gm) = e T e = E Then  L/  d  = 0 = N  d -1 +  d -3 (d-Gm) T (d-Gm) Or, solving for  d  d 2 = N -1 (d-Gm) T (d-Gm) = N -1 e T e

10 This is the Principle of Maximum Likelihood implies that  d 2 = N -1 (d-Gm) T (d-Gm) = N -1 e T e Is a good posterior estimate of the variance of the data, when the data follow a multivariate normal distribution the data are uncorrelated and with uniform (but unknown) variance,  d 2

11 But back to the general case … What formula for m does the rule Minimize (d-Gm) T C d -1 (d-Gm) imply ?

12 Trick … Minimize (d-Gm) T (d-Gm) Implies m = [G T G] -1 G T d Now write, Minimize (d-Gm) T C d -1 (d-Gm) = (d-Gm) T C d -1/2 C d -1/2 (d-Gm) = (C d -1/2 d-C d -1/2 Gm) T (C d -1/2 d-C d -1/2 Gm) = (d’-G’m) T (d’-G’m) with d’=C d -1/2 d G’ = C d -1/2 G This is simple least squares, so m= [G’ T G’] -1 G’ T d’ or m = [G T C d -1/2 C d -1/2 G] -1 G T C d -1/2 C d -1/2 d = [G T C d -1 G] -1 G T C d -1 d Symmetric, so it inverse and square root is symmetric, too

13 So, minimize (d-Gm) T C d -1 (d-Gm) implies m = [G T C d -1 G] -1 G T C d -1 d and C m = {[G T C d -1 G] -1 G T C d -1 } C d {[G T C d -1 G] -1 G T C d -1 } T = = [G T C d -1 G] -1 G T C d -1 G [G T C d -1 G] -1 = [G T C d -1 G] -1 Remember formula C m = M C d M T

14 Example with Correlated Noise Uncorrelated Noise Correlated Noise

15 Scatter Plots d i vs. d i+1 high correlation d i vs. d i+2 some correlation d i vs. d i+3 little correlation

16 data = straight line + correlated noise x d = a + bx + n

17 Model for C d [C d ] ij = exp{ -c |i-j| } with c=0.25 exponential falloff from main diagonal MatLab Code: c = 0.25; [XX, YY] = meshgrid( [1:N], [1:N] ); Cd = (sd^2)*exp(-c*abs(XX-YY));

18 Results d = a + bx + n x Both fits about the same … but Intercept Correlated 10.96 ± 20.6 Uncorrelated 8.42 ± 7.9 True 1.0 Slope Correlated 1.92 ± 0.35 Uncorrelated 1.97 ± 0.14 True 2.0 … note error estimates are larger (more realistic ?) for the correlated case

19 How to make correlated noise w = [0.1, 0.3, 0.7, 1.0, 0.7, 0.3, 0.1]'; w = w/sum(w); Nw = length(w); Nw2 = (Nw-1)/2; N=101; N2=(N-1)/2; n1 = random('Normal',0,1,N+Nw,1); n = zeros(N,1); for i = [-Nw2:Nw2] n = n + w(i+Nw2+1)*n1(i+Nw-Nw2:i+Nw+N-1-Nw2); end Define weighting function Start with uncorrelated noise Correlated noise is a weighted average of neighboring uncorrelated noise values

20 Let’s look at the transformations … d’=C d -1/2 d G’ = C d -1/2 G In the special case of uncorrelated data with different variances C d = diag(  1 2,  2 2, …  N 2 ) d i ’=  i -1 d i multiply each data by the reciprocal of its error G ij ’ =  i -1 G ij multiply each row of the data kernel by the same amount Then solve by ordinary least squares  1 2 0 0 … 0  2 2 0 … 0 0  3 2 …...

21    G 11  1  G 12  1  G 13 …  2  G 21  2  G 22  2  G 13 …  3  G 31  3  G 32  3  G 33 … …  N  G N1  N  G N2  N  G N3 … m =  1  d 1  2  d 2  3  d 3 …  N  d N Rows have been weighted by a factor of  i -1

22 So this special case is often called Weighted Least Squares Note that the total error is E = e T C d -1 e =  i  i -2 e i 2 Each individual error is weighted by the reciprocal of its variance, so errors involving data with SMALL variance get MORE weight weight

23 Example: fitting a straight line 100 data, first 50 have a different  d than the last 50

24 N=101; N2=(N-1)/2; sd(1:N2-1) = 5; sd(N2:N) = 100; sd2i = sd.^(-2); Cdi = diag(sd2i); G(:,1)=ones(N,1); G(:,2)=x; GTCdiGI=inv(G'*Cdi*G); m = GTCdiGI*G'*Cdi*d; d2 = m(1) + m(2).* x ; MatLab Code Note that C d -1 is explicitly defines as a diagonal matrix

25 Equal variance Left 50:  d = 5 right 50:  d = 5

26 Left has smaller variance first 50:  d = 5 last 50:  d = 100

27 Right has smaller variance first 50:  d = 100 last 50:  d = 5

28 Finally, two miscellaneous comments about least-squares

29 Comment 1 Case of fitting functions to a dataset d i = m 1 f 1 (x i ) + m 2 f 2 (x i ) + m 3 f 3 (x i ) … e.g. d i = m 1 sin(x i ) + m 2 cos(x i ) + m 3 sin(2x i ) …

30 f 1 (x 1 ) f 2 (x 1 ) f 3 (x 1 ) … f 1 (x 2 ) f 2 (x 2 ) f 3 (x 2 ) … f 1 (x 3 ) f 2 (x 3 ) f 3 (x 3 ) … … f 1 (x N ) f 2 (x N ) f 3 (x N ) … m = d1d2d3…dNd1d2d3…dN

31 Note that the matrix G T G has element i,j [G T G] ij =  i f i (x k )f j (x k ) = f i  f j and thus is diagonal if the functions are orthogonal

32 if the functions are normalized so f i  f i =  then G T G = I and the least squares solution is m = G T d and C m =  d 2 I super-simple formula! m i = f i  d guaranteed uncorrelated errors!

33 Example of Straight line x y x1x1 x2x2 x3x3 x4x4 x5x5 x y i = a + bx i implies f 1 (x) = 1 and f 2 (x) = x so condition f 1 (x)  f 2 (x)=0 implies  i x i = 0 or x=0 this happens when the x’s straddle the origin The choice f 1 (x) = 1 and f 2 (x) = x-x i.e. y = a’ + b’ (x-x) leads to uncorrelated errors in (a’,b’) a a’

34 Example – wavelet functions Localized oscillation with a character- istic frequency

35 GTGGTG “Almost” diagonal

36 Comment 2 sometimes writing least-squares as [G T G] m = G T d or G T [G m] = G T d is more useful than m = [G T G] -1 G T d since you can use some method other than a matrix inverse for solving the equation


Download ppt "Lecture 7 Advanced Topics in Least Squares. the multivariate normal distribution for data, d p(d) = (2  ) -N/2 |C d | -1/2 exp{ -1/2 (d-d) T C d -1 (d-d)"

Similar presentations


Ads by Google