Suppose that we have two variables 1. Y – the dependent variable (response variable) 2. X – the independent variable (explanatory variable, factor)
X, the independent variable may or may not be a random variable. Sometimes it is randomly observed. Sometimes specific values of X are selected
The dependent variable, Y, is assumed to be a random variable. The distribution of Y is dependent on X The object is to determine that distribution using statistical techniques. (Estimation and Hypothesis Testing)
These decisions will be based on data collected on both variable Y (the dependent variable) and X (the independent variable). Let (x 1, y 1 ), (x 2, y 2 ), …,(x n, y n ) denote n pairs of values measured on the independent variable (X) and the dependent variable (Y) The scatterplot: The graphical plot of the points: (x 1, y 1 ), (x 2, y 2 ), …,(x n, y n )
Assume that we have collected data on two variables X and Y. Let ( x 1, y 1 ) ( x 2, y 2 ) ( x 3, y 3 ) … ( x n, y n ) denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)
1.independent random variables. 2.Normally distributed. 3.Have the common variance, . 4.The mean of y i is i = + x i The assumption will be made that y 1, y 2, y 3 …, y n are Data that satisfies the assumptions above is to come from the Simple Linear Model
Each y i is assumed to be randomly generated from a normal distribution with mean i = + x i and standard deviation . yiyi + x i xixi
When data is correlated it falls roughly about a straight line.
The density of y i is: The joint density of y 1,y 2, …,y n is:
Estimation of the parameters the intercept the slope the standard deviation (or variance 2 )
The Least Squares Line Fitting the best straight line to “linear” data
Let Y = a + b X denote an arbitrary equation of a straight line. a and b are known values. This equation can be used to predict for each value of X, the value of Y. For example, if X = x i (as for the i th case) then the predicted value of Y is:
Define the residual for each case in the sample to be: The residual sum of squares (RSS) is defined as: The residual sum of squares (RSS) is a measure of the “goodness of fit of the line Y = a + bX to the data
One choice of a and b will result in the residual sum of squares attaining a minimum. If this is the case than the line: Y = a + bX is called the Least Squares Line
To find the Least Squares estimates, a and b, we need to solve the equations: and
It also can be shown that Thus, the maximum likelihood estimator of 2, is a biased estimator of 2. This estimator can be easily converted into an unbiased estimator of 2 by multiply by the ratio n/(n – 2)
Also and Thus all three estimators are functions of the set of complete sufficient statistics. If they are also unbiased then they are Uniform Minimum Variance Unbiased (UMVU) estimators (using the Lehmann-Scheffe theorem)
and We have already shown that s 2 is an unbiased estimator of 2. We need only show that: are unbiased estimators of and .
Consider the random variable Y with 1. E[Y] = 1 X 1 + 2 X 2 +... + p X p (alternatively E[Y] = 0 + 1 X 1 +... + p X p, intercept included) and 2. var(Y) = 2 where 1, 2,..., p are unknown parameters and X 1,X 2,..., X p are nonrandom variables. Assume further that Y is normally distributed.
Thus the density of Y is: f(Y| 1, 2,..., p, 2 ) = f(Y| , 2 )
Now suppose that n independent observations of Y, (y 1, y 2,..., y n ) are made corresponding to n sets of values of (X 1,X 2,..., X p ) - (x 11,x 12,..., x 1p ), (x 21,x 22,..., x 2p ),... (x n1,x n2,..., x np ). Then the joint density of y = (y 1, y 2,... y n ) is: f(y 1, y 2,..., y n | 1, 2,..., p, 2 ) =
Consider the random variable Y with 1. E[Y] = 0 + 1 X 1 + 2 X 2 +... + p X p (intercept included) and 2. var(Y) = 2 where 1, 2,..., p are unknown parameters and X 1,X 2,..., X p are nonrandom variables. Assume further that Y is normally distributed.
The matrix formulation (intercept included) Then the model becomes Thus to include an intercept add an extra column of 1’s in the design matrix X and include the intercept in the parameter vector
The matrix formulation of the Simple Linear regression model
Thus The Gauss-Markov theorem states that is the Best Linear Unbiased Estimator (B.L.U.E.) of
Hypothesis testing for the GLM The General Linear Hypothesis
Testing the General Linear Hypotheses The General Linear Hypothesis H 0 :h 11 1 + h 12 2 + h 13 3 +... + h 1p p = h 1 h 21 1 + h 22 2 + h 23 3 +... + h 2p p = h 2... h q1 1 + h q2 2 + h q3 3 +... + h qp p = h q where h 11 h 12, h 13,..., h qp and h 1 h 2, h 3,..., h q are known coefficients. In matrix notation