# Statistical Techniques I EXST7005 Simple Linear Regression.

## Presentation on theme: "Statistical Techniques I EXST7005 Simple Linear Regression."— Presentation transcript:

Statistical Techniques I EXST7005 Simple Linear Regression

n Measuring & describing a relationship between two variables è Simple Linear Regression allows a measure of the rate of change of one variable relative to another variable. è Variables will always be paired, one termed an independent variable (often referred to as the X variable) and a dependent variable (termed a Y variable). è There is a change in the value of variable Y as the value of variable X changes.

Simple Linear Regression (continued) n For each value of X there is a population of values for the variable Y (normally distributed). Y X

Simple Linear Regression (continued) n The linear model which discribes this relationship is given as è Yi = b0 + b1Xi è this is the equation for a straight line è where; b0 is the value of the intercept (the value of Y when X = 0) è b1 is the amount of change in Y for each unit change in X. (i.e. if X changes by 1 unit, Y changes by b1 units). b1 is also called the slope or REGRESSION COEFFICIENT

Simple Linear Regression (continued) n Population Parameters   y.x = the true population mean of Y at each value of X   0 = the true value of the Y intercept   1 = the true value of the slope, the change in Y per unit of X   y.x =  0 +  1Xi è this is the population equation for a straight line

Simple Linear Regression (continued) n The sample equation for the line describes a perfect line with no variation. In practice there is always variation about the line. We include an additional term to represent this variation.  y.x =  0 +  1Xi +  i for a population n Yi = b0 + b1Xi + ei for a sample è when we put this term in the model, we are describing individual points as their position on the line, plus or minus some deviation

Simple Linear Regression (continued) Y X n

n the SS of deviations from the line will form the basis of a variance for the regression line n when we leave the ei off the sample model, we are describing a point on the regression line predicted from the sample. To indicate this we put a HAT on the Yi value Simple Linear Regression (continued)

Characteristics of a Regression Line The line will pass through the point  X,  Y (also the point 0, b0) n The sum of squared deviations (measured vertically) of the points from the regression line will be a minimum. n Values on the line can be described by the equation Y = b0 + b1Xi

Fitting the line n Fitting the line starts with a corrected SSDeviation, this is the SSDeviation of the observations from a horizontal line through the mean. Y X

Fitting the line (continued) n The fitted line is pivoted on the point until it has a minimum SSDeviations. Y X

Fitting the line (continued) How do we know the SSDeviations are a minimum? Actually, we solve the equation for ei, and use calculus to determine the solution that has a minimum of  ei2.

Fitting the line (continued) n The line has some desirable properties  E(b0) =  0  E(b1) =  1  E(  YX) =  X.Y n Therefore, the parameter estimates and predicted values are unbiased estimates.

The regression of Y on X n Y = the "dependent" variable, the variable to be predicted n X = the "independent" variable, also called the regressor or predictor variable. n Assumptions - general assumptions è Y variable is normally distributed at each value of X è The variance is homogeneous (across X). è Observations are independent of each other and ei independent of the rest of the model.

The regression of Y on X (continued) n Special assumption for regression. n Assume that all of the variation is attributable to the dependent variable (Y), and that the variable X is measured WITHOUT ERROR. n Note that the deviations are measured vertically, not horizontally or perpendicular to the line.

Derivation of the formulas n Any observation can be written as è Yi = b0 + b1Xi + ei for a sample è where; ei = a deviation fo the observed point from the regression line n note, the idea of regression is to minimize the deviation of the observations from the regression line, this is called a Least Squares Fit

Derivation of the formulas (continued)  ei = 0 n the sum of the squared deviations   ei2 =  (Yi - Yhat)2   ei2 =  (Yi - b0 + b1Xi )2 The objective is to select b0 and b1 such that  ei2 is a minimum, this is done with calculus n You do not need to know this derivation!

A note on calculations n We have previously defined the uncorrected sum of squares and corrected sum of squares of a variable Yi The uncorrected SS is  Yi2 The correction factor is (  Yi)2/n The corrected SS is  Yi2 - (  Yi)2/n n Your book calls this SYY, the correction factor is CYY n We could define the exact same series of calculations for Xi, and call it SXX

A note on calculations (continued) n We will also need a crossproduct for regression, and a corrected crossproduct n The crossproduct is XiYi The Sum of crossproducts is  XiYi, which is uncorrected The correction factor is (  Xi)(  Yi) / n = CXY The corrected crossproduct is  XiYi-(  Xi)(  Yi)/n n Which you book calls SXY

n the partial derivative is taken with respect to each of the parameters for b0 Derivation of the formulas (continued)

n set the partial derivative to 0 and solve for b0  2  (Yi-b0-b1Xi)(-1) = 0  -  Yi + nb0 + b1  Xi = 0  nb0 =  Yi - b1  Xi  b0 =  Y - b1  X n So b0 is estimated using b1 and the means of X and Y

Derivation of the formulas (continued) n Likewise for b1 we obtain the partial derivative

Derivation of the formulas (continued) n set the partial derivative to 0 and solve for b1  2  (Yi-b0-b1Xi)(-Xi) = 0  -  (YiXi + b0Xi + b1 Xi2) = 0  -  YiXi + b0  Xi + b1  Xi2) = 0  and since b0 =  Y - b1  X ), then   YiXi = (  Yi/n - b1  Xi/n )  Xi + b1  Xi2   YiXi =  Xi  Yi/n - b1 (  Xi)2/n + b1  Xi2   YiXi -  Xi  Yi/n = b1 [  Xi2 - (  Xi)2/n]  b1 = [  YiXi -  Xi  Yi/n] / [  Xi2 - (  Xi)2/n]

Derivation of the formulas (continued) b1 = [  YiXi -  Xi  Yi/n] / [  Xi2 - (  Xi)2/n] n b1 = SXY / SXX n so b1 is the corrected crossproducts over the corrected SS of X The intermediate statistics needed to solve all elements of a SLR are  Xi,  Yi, n,  Xi2,  YiXi and  Yi2 (this last term we haven't seen in the calculations above, but we will need later)

Derivation of the formulas (continued) n Review n We want to fit the best possible line, we define this as the line that minimizes the vertically measured distances from the observed values to the fitted line. n The line that achieves this is defined by the equations  b0 =  Y - b1  X  b1 = [  YiXi -  Xi  Yi/n] / [  Xi2 - (  Xi)2/n]

Derivation of the formulas (continued) n These calculations provide us with two parameter estimates that we can then use to get the equation for the fitted line.

Numerical example n See Regression handout

About Crossproducts n Crossproducts are used in a number of related calculations. n a crossproduct = YiXi Sum of crossproducts =  YiXi = SXY Covariance =  YiXi / (n-1) n Slope = SXY / SXX n SSRegression = S2XY / SXX Correlation = SXY /  SXXSYY n R2 = r2 = S2XY / SXXSYY = SSRegression/SSTotal