Presentation on theme: "Simple Linear Regression"— Presentation transcript:
1 Simple Linear Regression With Thanks to My Students in AMS572 – Data Analysis
2 1. Introduction Example: Brad Pitt: mAngelina Jolie: 1.70mGeorge Bush :1.81mLaura Bush: ?David Beckham: mVictoria Beckham: 1.68m● To predict height of the wife in a couple, based on the husband’s heightResponse (out come or dependent) variable (Y): height of the wifePredictor (explanatory or independent) variable (X): height of the husband
3 History: Regression analysis: ● regression analysis is a statistical methodology to estimate the relationship of a response variable to a set of predictor variable.● when there is just one predictor variable, we will use simple linear regression. When there are two or more predictor variables, we use multiple linear regression.● when it is not clear which variable represents a response and which is a predictor, correlation analysis is used to study the strength of the relationshipHistory:● The earliest form of linear regression was the method of least squares, which was published by Legendre in 1805, and by Gauss in 1809.● The method was extended by Francis Galton in the 19th century to describe a biological phenomenon.● This work was extended by Karl Pearson and Udny Yule to a more general statistical context around 20th century.
4 A probabilistic modelWe denote the n observed values of the predictor variable x asWe denote the corresponding observed values of the response variable Y as
5 Notations of the simple linear Regression - Observed value of the random variable Yi depends on xi- random error withunknown mean of YiTrue Regression LineUnknown SlopeUnknown Intercept
7 4 BASIC ASSUMPTIONS – for statistical inference Linear function of the predictor variableHave a common variance,Same for all values of x.Normally distributedIndependent
8 Conditional expectation of Y given X = x Comments:1. Linear not in xBut in the parameters andExample:linear, logx = x*2. Predictor variable is not set as predetermined fixed values,is random along with Y. The model can be considered as a conditional modelExample: Height and Weight of the children.Height (X) – givenWeight (Y) – predictConditional expectation of Y given X = x
9 2. Fitting the Simple Linear Regression Model 2.1 Least Squares (LS) Fit
10 Example 10. 1 (Tires Tread Wear vs. Mileage: Scatter Plot Example 10.1 (Tires Tread Wear vs. Mileage: Scatter Plot. From: Statistics and Data Analysis; Tamhane and Dunlop; Prentice Hall. )
12 One way to find the LS estimate and The “best” fitting straight line in the sense of minimizing Q: LS estimateOne way to find the LS estimate andSetting these partial derivatives equal to zero and simplifying, we get
14 To simplify, we introduce The resulting equation is known as the least squares line, which is an estimate of the true regression line.
15 Example 10.2 (Tire Tread vs. Mileage: LS Line Fit) Find the equation of the line for the tire tread wear data from Table10.1,we haveand n=9.From these we calculate
16 The slope and intercept estimates are Therefore, the equation of the LS line isConclusion: there is a loss of mils in the tire groove depth for every 1000 miles of driving.Given a particularWe can findWhich means the mean groove depth for all tires driven for 25,000miles is estimated to be miles.
17 2.2 Goodness of Fit of the LS Line Coefficient of Determination and CorrelationThe residuals:are used to evaluate the goodness of fit of the LS line.
18 We define:Note: total sum of squares (SST)Regression sum of squares (SSR)Error sum of squares (SSE)is called the coefficient of determination
19 Example 10. 3 (Tire Tread Wear vs Example 10.3 (Tire Tread Wear vs. Mileage: Coefficient of Determination and CorrelationFor the tire tread wear data, calculate using the result s from example 10.2 We haveNext calculateThereforeThe Pearson correlation iswhere the sign of r follows from the sign of since 95.3% of the variation in tread wear is accounted for by linear regression on mileage, the relationship between the two is strongly linear with a negative slope.
20 The Maximum Likelihood Estimators (MLE) Consider the linear model:where is drawn from a normal population with mean 0 and standard deviation σ, the likelihood function for Y is:Thus, the log-likelihood for the data is:
21 The MLE Estimators Solving We obtain the MLEs of the three unknown model parametersThe MLEs of the model parameters a and b are the same as the LSEs – both unbiasedThe MLE of the error variance, however, is biased:
22 2.3 An Unbiased Estimator of s2 An unbiased estimate of is given byExample 10.4(Tire Tread Wear Vs. Mileage: Estimate ofFind the estimate of for the tread wear data using the results from Example 10.3 We have SSE= and n-2=7,thereforeWhich has 7 d.f. The estimate of is miles.
23 Under the normal error assumption * Point estimators: 3. Statistical Inference on b0 and b1Under the normal error assumption* Point estimators:* Sampling distributions of and :For mathematical derivations, please refer to the Tamhane and Dunlop text book, P331.
24 * Pivotal Quantities (P.Q.’s): Statistical Inference on b0 and b1 , Con’t* Pivotal Quantities (P.Q.’s):* Confidence Intervals (CI’s):
25 Statistical Inference on b0 and b1 , Con’t * Hypothesis tests:-- Test statistics:-- At the significance level , we reject infavor of if and only if (iff)-- The first test is used to show whether there is a linear relationship between x and y
26 -- a sum of squares divided by its d.f. Analysis of Variance (ANOVA), Con’tMean Square:-- a sum of squares divided by its d.f.
27 Analysis of Variance (ANOVA) ANOVA TableExample:Source of Variation(Source)Sum of Squares(SS)Degrees of Freedom(d.f.)Mean Square(MS)FRegressionErrorSSRSSE1n - 2TotalSSTn - 1SourceSSd.f.MSFRegressionError50,887.2017361.25140.71Total53,418.738
28 4. Regression Diagnostics 4.1 Checking for Model Assumptions Checking for LinearityChecking for Constant VarianceChecking for NormalityChecking for Independence
29 Checking for Linearity Xi =Mileage Y=β0 + β1 xYi =Groove Depth ^ ^ ^^ Y=β0 + β1 xYi =fitted value ^ei =residual Residual = ei = Yi- YiiXiYi^ei1394.33360.6433.6924329.50331.51-2.0138291.00302.39-11.3912255.17273.27-18.10516229.33244.15-14.82620204.83215.02-10.19724179.00185.90-6.9028163.83156.787.05932150.33127.6622.67
31 Checking for Constant Variance Var(Y) is not constant A sample residual plots whenVar(Y) is constant.
32 Checking for Independence Does not apply for Simple Linear Regression ModelOnly apply for time series data
33 4.2 Checking for Outliers & Influential Observations What is OUTLIERWhy checking for outliers is importantMathematical definitionHow to deal with them
34 4.2-A. Intro Recall Box and Whiskers Plot (Chapter 4 of T&D) Where (mild) OUTLIER is defined as any observations that lies outside of Q1-(1.5*IQR) and Q3+(1.5*IQR) (Interquartile range, IQR = Q3 − Q1)(Extreme) OUTLIER as that lies outside of Q1-(3*IQR) and Q3+(3*IQR)Observation "far away" from the rest of the data
35 4.2-B. Why are outliers a problem? May indicate a sample peculiarity or a data entry error or other problem ;Regression coefficients estimated that minimize the Sum of Squares for Error (SSE) are very sensitive to outliers >>Bias or distortion of estimates;Any statistical test based on sample means and variances can be distorted In the presence of outliers >>Distortion of p-values;Faulty conclusions.Example:( Estimators not sensitive to outliers are said to be robust )Sorted DataMedianMeanVariance95% CI for meanReal Data56.020.6[0.45, 11.55]Data with Error27.62676.8[ ,91.83]
36 4.2-C. Mathematical Definition OutlierThe standardized residual is given byIf |ei*|>2, then the corresponding observation may be regarded an outlier.Example: (Tire Tread Wear vs. Mileage)STUDENTIZED RESIDUAL: a type of standardized residual calculated with the current observation deleted from the analysis.The LS fit can be excessively influenced by observation that is not necessarily an outlier as defined above.i123456789ei*2.25-0.12-0.66-1.02-0.83-0.57-0.400.431.51
37 4.2-C. Mathematical Definition Influential ObservationObservation with extreme x-value, y-value, or both.On average hii is (k+1)/n, regard any hii>2(k+1)/n as high leverage;If xi deviates greatly from mean x, then hii is large;Standardized residual will be large for a high leverage observation;Influence can be thought of as the product of leverage and outlierness.Example: (Observation is influential/ high leverage, but not an outlier)eg.1 with without eg.2 scatter plot residual plot
38 4.2-C. SAS code of the tire example Data tire;Input x y;Datalines;…;Run;proc reg data=tire;model y=x;output out=resid rstudent=r h=lev cookd=cddffits=dffit;proc print data=resid;where abs(r)>=2 or lev>(4/9) or cd>(4/9) orabs(dffit)>(2*sqrt(1/9));run;
40 4.2-D. How to deal with Outliers & Influential Observations Investigate (Data errors? Rare events? Can be corrected?)Ways to accommodate outliersNon Parametric Methods (robust to outliers)Data TransformationsDeletion (or report model results both with and without the outliers or influential observations to see how much they change)
41 4.3 Data Transformations Reason To achieve linearity To achieve homogeneity of varianceTo achieve normality or symmetry about the regression equation
42 Types of Transformation Linearzing Transformationtransformation of a response variable, or predicted variable, or both, which produces an approximate linear relationship between variables.Variance Stabilizing Transformationmake transformation if the constant variance assumption is violated
43 Linearizing Transformation Use mathematical operation, e.g. square root, power, log, exponential, etc.Only one variable needs to be transformed in the simple linear regression.Which one? Predictor or Response? Why?
44 e.g. We take a log transformation on Y = a exp (-bx) <=> log Y = log a - b x XiYi^log YiY =exp (logYi)Ei394.335.926374.6419.694329.505.807332.58-3.088291.005.688295.24-4.2412255.175.569262.09-6.9216229.335.450232.67-3.3420204.835.331206.54-1.7124179.005.211183.36-4.3628163.835.092162.771.0632150.334.973144.505.83
45 Variance Stabilizing Transformation Delta method : Two terms Taylor-series approximationsVar( h(Y)) ≈ [h(m)]2 g2 (m) where Var(Y) = g2(m), E(Y) = mset [h’(m)]2 g2 (m) = 1h’(m) =h(m) = h(y) =e.g. Var(Y) = c2 m2 , where c > 0, g(m) = cm ↔ g(y) = cyh(y) = = =Therefore it is the logarithmic transformation
46 5. Correlation AnalysisPearson Product Moment Correlation: a measurement of how closely two variables share a linear relationship.Useful when it is not possible to determine which variable is the predictor and which is the response.Health vs wealth. Which is predictor? Which is response?
47 Statistical Inference on the Correlation Coefficient ρ We can derive a test on the correlation coefficient in the same way that we have been doing in class.AssumptionsX, Y are from the bivariate normal distributionStart with point estimatorr: sample correlation coefficient: estimator of the population correlation coefficient ρGet the pivotal quantityThe distribution of r is quite complicatedT0: test statistic for ρ = 0Do we know everything about the p.q.?Yes: T ~ tn-2 under H0 : ρ=0
48 Bivariate Normal Distribution pdf:Propertiesμ1, μ2 means for X, Yσ12, σ22 variances for X, Yρ the correlation coeff between X, Y
49 Derivation of T0Therefore, we can use t as a statistic for testing against the null hypothesis H0: β1=0Equivalently, we can test against H0: ρ=0
50 Exact Statistical Inference on ρ ExampleA researcher wants to determine if two test instruments give similar results. The two test instruments are administered to a sample of 15 students. The correlation coefficient between the two sets of scores is found to be Is this correlation statistically significant at the .01 level?H0 : ρ=0 , Ha : ρ≠0for α = .01, = t0 > t13, .005 = 3.012▲ Reject H0TestH0 : ρ=0 , Ha : ρ≠0Test statistic:Reject H0 iff
51 Approximate Statistical Inference on ρ There is no exact method of testing ρ vs an arbitrary ρ0Distribution of R is very complicatedT0 ~ tn-2 only when ρ = 0To test ρ vs an arbitrary ρ0 one can use Fisher’s transformationTherefore, let
52 Approximate Statistical Inference on ρ Sample estimate:Z test statistic:CI for ρ:We reject H0 if |z0| > zα/2
53 Approximate Statistical Inference on ρ using SAS Code:Output:
54 Pitfalls of Regression and Correlation Analysis Correlation and causationTicks cause good healthCoincidental dataSun spots and republicansLurking variablesChurch, suicide, populationRestricted rangeLocal, global linearity
55 Summary Linear regression analysis Model Assumptions Correlation Coefficient rThe Least squares (LS) estimates: b0 and b1Probabilistic modelfor Linear regression:CorrelationAnalysisOutliers?Influential Observations?Data Transformations?Confidence Interval & Prediction interval
56 Sample correlation coefficient r Least Squares (LS) FitSample correlation coefficient rStatistical inference on ß0 & ß1Prediction IntervalModel AssumptionsCorrelation AnalysisLinearity Constant Variance Normality Independence