Lecture 7 Advanced Topics in Least Squares. the multivariate normal distribution for data, d p(d) = (2  ) -N/2 |C d | -1/2 exp{ -1/2 (d-d) T C d -1 (d-d)

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

The Maximum Likelihood Method
Environmental Data Analysis with MatLab
Environmental Data Analysis with MatLab Lecture 21: Interpolation.
General Linear Model With correlated error terms  =  2 V ≠  2 I.
The Simple Regression Model
Environmental Data Analysis with MatLab Lecture 8: Solving Generalized Least Squares Problems.
Lecture 13 L1 , L∞ Norm Problems and Linear Programming
Lecture 15 Orthogonal Functions Fourier Series. LGA mean daily temperature time series is there a global warming signal?
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
P M V Subbarao Professor Mechanical Engineering Department
The General Linear Model. The Simple Linear Model Linear Regression.
Environmental Data Analysis with MatLab
Lecture 19 Continuous Problems: Backus-Gilbert Theory and Radon’s Problem.
Lecture 4 The L 2 Norm and Simple Least Squares. Syllabus Lecture 01Describing Inverse Problems Lecture 02Probability and Measurement Error, Part 1 Lecture.
The Simple Linear Regression Model: Specification and Estimation
Lecture 6 Resolution and Generalized Inverses. Syllabus Lecture 01Describing Inverse Problems Lecture 02Probability and Measurement Error, Part 1 Lecture.
Lecture 8 Advanced Topics in Least Squares - Part Two -
Basics of regression analysis I Purpose of linear models Least-squares solution for linear models Analysis of diagnostics.
Environmental Data Analysis with MatLab Lecture 5: Linear Models.
Lecture 3 Review of Linear Algebra Simple least-squares.
1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Lecture 5 Probability and Statistics. Please Read Doug Martinson’s Chapter 3: ‘Statistics’ Available on Courseworks.
Generalized Regression Model Based on Greene’s Note 15 (Chapter 8)
Lecture 8 The Principle of Maximum Likelihood. Syllabus Lecture 01Describing Inverse Problems Lecture 02Probability and Measurement Error, Part 1 Lecture.
7. Least squares 7.1 Method of least squares K. Desch – Statistical methods of data analysis SS10 Another important method to estimate parameters Connection.
Lecture 11 Vector Spaces and Singular Value Decomposition.
Probability & Statistics for Engineers & Scientists, by Walpole, Myers, Myers & Ye ~ Chapter 11 Notes Class notes for ISE 201 San Jose State University.
Linear and generalised linear models
Basic Mathematics for Portfolio Management. Statistics Variables x, y, z Constants a, b Observations {x n, y n |n=1,…N} Mean.
Linear and generalised linear models
Basics of regression analysis
Lecture 3: Inferences using Least-Squares. Abstraction Vector of N random variables, x with joint probability density p(x) expectation x and covariance.
Environmental Data Analysis with MatLab Lecture 7: Prior Information.
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Chapter 5 Transformations and Weighting to Correct Model Inadequacies
Linear regression models in matrix terms. The regression function in matrix terms.
Lecture 10A: Matrix Algebra. Matrices: An array of elements Vectors Column vector Row vector Square matrix Dimensionality of a matrix: r x c (rows x columns)
Calibration & Curve Fitting
Chapter 15 Modeling of Data. Statistics of Data Mean (or average): Variance: Median: a value x j such that half of the data are bigger than it, and half.
880.P20 Winter 2006 Richard Kass 1 Confidence Intervals and Upper Limits Confidence intervals (CI) are related to confidence limits (CL). To calculate.
Stats for Engineers Lecture 9. Summary From Last Time Confidence Intervals for the mean t-tables Q Student t-distribution.
Some matrix stuff.
R. Kass/W03P416/Lecture 7 1 Lecture 7 Some Advanced Topics using Propagation of Errors and Least Squares Fitting Error on the mean (review from Lecture.
Physics 114: Exam 2 Review Lectures 11-16
Linear Regression Andy Jacobson July 2006 Statistical Anecdotes: Do hospitals make you sick? Student’s story Etymology of “regression”
Multivariate Statistics Matrix Algebra I W. M. van der Veld University of Amsterdam.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Risk Analysis & Modelling Lecture 2: Measuring Risk.
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
Brief Review Probability and Statistics. Probability distributions Continuous distributions.
Statistics……revisited
- 1 - Preliminaries Multivariate normal model (section 3.6, Gelman) –For a multi-parameter vector y, multivariate normal distribution is where  is covariance.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
MathematicalMarketing Slide 5.1 OLS Chapter 5: Ordinary Least Square Regression We will be discussing  The Linear Regression Model  Estimation of the.
R. Kass/Sp07P416/Lecture 71 More on Least Squares Fit (LSQF) In Lec 5, we discussed how we can fit our data points to a linear function (straight line)
Colorado Center for Astrodynamics Research The University of Colorado 1 STATISTICAL ORBIT DETERMINATION Statistical Interpretation of Least Squares ASEN.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Probability and Statistics for Particle Physics Javier Magnin CBPF – Brazilian Center for Research in Physics Rio de Janeiro - Brazil.
The Maximum Likelihood Method
The Maximum Likelihood Method
The Maximum Likelihood Method
ECE 417 Lecture 4: Multivariate Gaussians
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
5.2 Least-Squares Fit to a Straight Line
Simple Linear Regression
Principles of the Global Positioning System Lecture 11
#21 Marginalize vs. Condition Uninteresting Fitted Parameters
Topic 11: Matrix Approach to Linear Regression
Presentation transcript:

Lecture 7 Advanced Topics in Least Squares

the multivariate normal distribution for data, d p(d) = (2  ) -N/2 |C d | -1/2 exp{ -1/2 (d-d) T C d -1 (d-d) } Let’s assume that the expectation d Is given by a general linear model d = Gm And that the covariance C d is known (prior covariance)

Then we have a distribution P(d; m) with unknown parameters, m p(d)=(2  ) -N/2 |C d | -1/2 exp{ -½ (d-Gm) T C d -1 (d-Gm) } We can now apply the principle of maximum likelihood To estimate the unknown parameters m

Principle of Maximum Likelihood Last lecture we stated this principle as L(m) =  i ln p(d i ; m) with respect to m but in this distribution the whole data vector d is being treated as a single quantity So the princple becomes simply Maximize L(m) = ln p(d; m) p(d;m)=(2  ) -N/2 |C d | -1/2 exp{ -½ (d-Gm) T C d -1 (d-Gm) }

L(m) = ln p(d; m) = - ½Nln (2  ) - ½ln (|C d |) - ½(d-Gm) T C d -1 (d-Gm) The first two terms do not contain m, so the principle of maximum likelihood is Maximize -½ (d-Gm) T C d -1 (d-Gm) or Minimize (d-Gm) T C d -1 (d-Gm)

Special case of uncorrelated data with equal variance C d =  d 2 I Minimize  d -2 (d-Gm) T (d-Gm) with respect to m Which is the same as Minimize (d-Gm) T (d- Gm) with respect to m This is the Principle of Least Squares

minimize E = e T e = (d-Gm) T (d-Gm) with respect to m follows from the Principle of Maximum Likelihood in the special case of a multivariate Normal distribution the data being uncorrelated and of equal variance

Corollary If your data are NOT NORMALLY DISTRIBUTED Then least-squares is not the right method to use!

What if C d =  d 2 I but  d is unknown? note |C d | =  2N L(m,  d ) = -½Nln(2  ) - ½ln(|C d |) - ½(d-Gm) T C d -1 (d-Gm) = -½Nln(2  ) – Nln(  d ) - ½  d -2 (d-Gm) T (d-Gm) The first two terms do not contain m, so the principle of maximum likelihood still implies: Minimize (d-Gm) T (d-Gm) = e T e = E Then  L/  d  = 0 = N  d -1 +  d -3 (d-Gm) T (d-Gm) Or, solving for  d  d 2 = N -1 (d-Gm) T (d-Gm) = N -1 e T e

This is the Principle of Maximum Likelihood implies that  d 2 = N -1 (d-Gm) T (d-Gm) = N -1 e T e Is a good posterior estimate of the variance of the data, when the data follow a multivariate normal distribution the data are uncorrelated and with uniform (but unknown) variance,  d 2

But back to the general case … What formula for m does the rule Minimize (d-Gm) T C d -1 (d-Gm) imply ?

Trick … Minimize (d-Gm) T (d-Gm) Implies m = [G T G] -1 G T d Now write, Minimize (d-Gm) T C d -1 (d-Gm) = (d-Gm) T C d -1/2 C d -1/2 (d-Gm) = (C d -1/2 d-C d -1/2 Gm) T (C d -1/2 d-C d -1/2 Gm) = (d’-G’m) T (d’-G’m) with d’=C d -1/2 d G’ = C d -1/2 G This is simple least squares, so m= [G’ T G’] -1 G’ T d’ or m = [G T C d -1/2 C d -1/2 G] -1 G T C d -1/2 C d -1/2 d = [G T C d -1 G] -1 G T C d -1 d Symmetric, so it inverse and square root is symmetric, too

So, minimize (d-Gm) T C d -1 (d-Gm) implies m = [G T C d -1 G] -1 G T C d -1 d and C m = {[G T C d -1 G] -1 G T C d -1 } C d {[G T C d -1 G] -1 G T C d -1 } T = = [G T C d -1 G] -1 G T C d -1 G [G T C d -1 G] -1 = [G T C d -1 G] -1 Remember formula C m = M C d M T

Example with Correlated Noise Uncorrelated Noise Correlated Noise

Scatter Plots d i vs. d i+1 high correlation d i vs. d i+2 some correlation d i vs. d i+3 little correlation

data = straight line + correlated noise x d = a + bx + n

Model for C d [C d ] ij = exp{ -c |i-j| } with c=0.25 exponential falloff from main diagonal MatLab Code: c = 0.25; [XX, YY] = meshgrid( [1:N], [1:N] ); Cd = (sd^2)*exp(-c*abs(XX-YY));

Results d = a + bx + n x Both fits about the same … but Intercept Correlated ± 20.6 Uncorrelated 8.42 ± 7.9 True 1.0 Slope Correlated 1.92 ± 0.35 Uncorrelated 1.97 ± 0.14 True 2.0 … note error estimates are larger (more realistic ?) for the correlated case

How to make correlated noise w = [0.1, 0.3, 0.7, 1.0, 0.7, 0.3, 0.1]'; w = w/sum(w); Nw = length(w); Nw2 = (Nw-1)/2; N=101; N2=(N-1)/2; n1 = random('Normal',0,1,N+Nw,1); n = zeros(N,1); for i = [-Nw2:Nw2] n = n + w(i+Nw2+1)*n1(i+Nw-Nw2:i+Nw+N-1-Nw2); end Define weighting function Start with uncorrelated noise Correlated noise is a weighted average of neighboring uncorrelated noise values

Let’s look at the transformations … d’=C d -1/2 d G’ = C d -1/2 G In the special case of uncorrelated data with different variances C d = diag(  1 2,  2 2, …  N 2 ) d i ’=  i -1 d i multiply each data by the reciprocal of its error G ij ’ =  i -1 G ij multiply each row of the data kernel by the same amount Then solve by ordinary least squares  … 0  … 0 0  3 2 …...

   G 11  1  G 12  1  G 13 …  2  G 21  2  G 22  2  G 13 …  3  G 31  3  G 32  3  G 33 … …  N  G N1  N  G N2  N  G N3 … m =  1  d 1  2  d 2  3  d 3 …  N  d N Rows have been weighted by a factor of  i -1

So this special case is often called Weighted Least Squares Note that the total error is E = e T C d -1 e =  i  i -2 e i 2 Each individual error is weighted by the reciprocal of its variance, so errors involving data with SMALL variance get MORE weight weight

Example: fitting a straight line 100 data, first 50 have a different  d than the last 50

N=101; N2=(N-1)/2; sd(1:N2-1) = 5; sd(N2:N) = 100; sd2i = sd.^(-2); Cdi = diag(sd2i); G(:,1)=ones(N,1); G(:,2)=x; GTCdiGI=inv(G'*Cdi*G); m = GTCdiGI*G'*Cdi*d; d2 = m(1) + m(2).* x ; MatLab Code Note that C d -1 is explicitly defines as a diagonal matrix

Equal variance Left 50:  d = 5 right 50:  d = 5

Left has smaller variance first 50:  d = 5 last 50:  d = 100

Right has smaller variance first 50:  d = 100 last 50:  d = 5

Finally, two miscellaneous comments about least-squares

Comment 1 Case of fitting functions to a dataset d i = m 1 f 1 (x i ) + m 2 f 2 (x i ) + m 3 f 3 (x i ) … e.g. d i = m 1 sin(x i ) + m 2 cos(x i ) + m 3 sin(2x i ) …

f 1 (x 1 ) f 2 (x 1 ) f 3 (x 1 ) … f 1 (x 2 ) f 2 (x 2 ) f 3 (x 2 ) … f 1 (x 3 ) f 2 (x 3 ) f 3 (x 3 ) … … f 1 (x N ) f 2 (x N ) f 3 (x N ) … m = d1d2d3…dNd1d2d3…dN

Note that the matrix G T G has element i,j [G T G] ij =  i f i (x k )f j (x k ) = f i  f j and thus is diagonal if the functions are orthogonal

if the functions are normalized so f i  f i =  then G T G = I and the least squares solution is m = G T d and C m =  d 2 I super-simple formula! m i = f i  d guaranteed uncorrelated errors!

Example of Straight line x y x1x1 x2x2 x3x3 x4x4 x5x5 x y i = a + bx i implies f 1 (x) = 1 and f 2 (x) = x so condition f 1 (x)  f 2 (x)=0 implies  i x i = 0 or x=0 this happens when the x’s straddle the origin The choice f 1 (x) = 1 and f 2 (x) = x-x i.e. y = a’ + b’ (x-x) leads to uncorrelated errors in (a’,b’) a a’

Example – wavelet functions Localized oscillation with a character- istic frequency

GTGGTG “Almost” diagonal

Comment 2 sometimes writing least-squares as [G T G] m = G T d or G T [G m] = G T d is more useful than m = [G T G] -1 G T d since you can use some method other than a matrix inverse for solving the equation