Simple Linear Regression

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Regression and correlation methods
The Multiple Regression Model.
Simple Linear Regression & Correlation Instructor: Prof. Wei Zhu 11/21/2013 AMS 572 Group Project.
Forecasting Using the Simple Linear Regression Model and Correlation
Inference for Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Regression Analysis Module 3. Regression Regression is the attempt to explain the variation in a dependent variable using the variation in independent.
EPI 809/Spring Probability Distribution of Random Error.
Objectives (BPS chapter 24)
Chapter 10 Simple Regression.
Chapter 12 Multiple Regression
Chapter 13 Introduction to Linear Regression and Correlation Analysis
The Simple Regression Model
SIMPLE LINEAR REGRESSION
Chapter Topics Types of Regression Models
Chapter 11 Multiple Regression.
Introduction to Probability and Statistics Linear Regression and Correlation.
SIMPLE LINEAR REGRESSION
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Simple Linear Regression and Correlation
Chapter 7 Forecasting with Simple Regression
Introduction to Regression Analysis, Chapter 13,
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Linear Regression/Correlation
Correlation & Regression
Regression and Correlation Methods Judy Zhong Ph.D.
SIMPLE LINEAR REGRESSION
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Simple Linear Regression AMS /29/2010. Outline 1.Brief History and Motivation – Zhen Gong 2.Simple Linear Regression Model – Wenxiang Liu 3.Ordinary.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
SIMPLE LINEAR REGRESSION AND CORRELATION Prepared by: Jackie Zerrle David Fried Chun-Hui Chung Weilai Zhou Shiyhan Zhang Alex Fields Yu-Hsun Cheng Roosevelt.
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
© 2003 Prentice-Hall, Inc.Chap 13-1 Basic Business Statistics (9 th Edition) Chapter 13 Simple Linear Regression.
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Simple Linear Regression Introduction Response (out come or dependent) variable (Y): height of the wife Predictor (explanatory or independent) variable.
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
Chapter 8: Simple Linear Regression Yang Zhenlin.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Chapter 11 Linear Regression and Correlation. Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and.
Chapter 13 Simple Linear Regression
The simple linear regression model and parameter estimation
Chapter 20 Linear and Multiple Regression
Inference for Least Squares Lines
Regression Analysis: Statistical Inference
Statistics for Managers using Microsoft Excel 3rd Edition
Chapter 11: Simple Linear Regression
Chapter 11 Simple Regression
Chapter 13 Simple Linear Regression
CHAPTER 29: Multiple Regression*
6-1 Introduction To Empirical Models
Linear Regression/Correlation
Simple Linear Regression
Basic Practice of Statistics - 3rd Edition Inference for Regression
Simple Linear Regression and Correlation
Linear Regression and Correlation
CH 10 Simple Linear Regression
Linear Regression and Correlation
Introduction to Regression
Chapter 13 Simple Linear Regression
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Simple Linear Regression With Thanks to My Students in AMS572 – Data Analysis

1. Introduction Example: Brad Pitt: 1.83m Angelina Jolie: 1.70m George Bush :1.81m Laura Bush: ? David Beckham: 1.83m Victoria Beckham: 1.68m ● To predict height of the wife in a couple, based on the husband’s height Response (out come or dependent) variable (Y): height of the wife Predictor (explanatory or independent) variable (X): height of the husband

History: Regression analysis: ●  regression analysis is a statistical methodology to estimate the relationship of a response variable to a set of predictor variable. ● when there is just one predictor variable, we will use simple linear regression. When there are two or more predictor variables, we use multiple linear regression. ● when it is not clear which variable represents a response and which is a predictor, correlation analysis is used to study the strength of the relationship History: ● The earliest form of linear regression was the method of least squares, which was published by Legendre in 1805, and by Gauss in 1809. ● The method was extended by Francis Galton in the 19th century to describe a biological phenomenon. ● This work was extended by Karl Pearson and Udny Yule to a more general statistical context around 20th century.

A probabilistic model We denote the n observed values of the predictor variable x as We denote the corresponding observed values of the response variable Y as

Notations of the simple linear Regression - Observed value of the random variable Yi depends on xi - random error with unknown mean of Yi True Regression Line Unknown Slope Unknown Intercept

4 BASIC ASSUMPTIONS – for statistical inference Linear function of the predictor variable Have a common variance, Same for all values of x. Normally distributed Independent

Conditional expectation of Y given X = x Comments: 1. Linear not in x But in the parameters and Example: linear, logx = x* 2. Predictor variable is not set as predetermined fixed values, is random along with Y. The model can be considered as a conditional model Example: Height and Weight of the children. Height (X) – given Weight (Y) – predict Conditional expectation of Y given X = x

2. Fitting the Simple Linear Regression Model 2.1 Least Squares (LS) Fit

Example 10. 1 (Tires Tread Wear vs. Mileage: Scatter Plot Example 10.1 (Tires Tread Wear vs. Mileage: Scatter Plot. From: Statistics and Data Analysis; Tamhane and Dunlop; Prentice Hall. )

One way to find the LS estimate and The “best” fitting straight line in the sense of minimizing Q: LS estimate One way to find the LS estimate and Setting these partial derivatives equal to zero and simplifying, we get

Solve the equations and we get

To simplify, we introduce The resulting equation is known as the least squares line, which is an estimate of the true regression line.

Example 10.2 (Tire Tread vs. Mileage: LS Line Fit) Find the equation of the line for the tire tread wear data from Table10.1,we have and n=9.From these we calculate

The slope and intercept estimates are Therefore, the equation of the LS line is Conclusion: there is a loss of 7.281 mils in the tire groove depth for every 1000 miles of driving. Given a particular We can find Which means the mean groove depth for all tires driven for 25,000miles is estimated to be 178.62 miles.

2.2 Goodness of Fit of the LS Line Coefficient of Determination and Correlation The residuals: are used to evaluate the goodness of fit of the LS line.

We define: Note: total sum of squares (SST) Regression sum of squares (SSR) Error sum of squares (SSE) is called the coefficient of determination

Example 10. 3 (Tire Tread Wear vs Example 10.3 (Tire Tread Wear vs. Mileage: Coefficient of Determination and Correlation For the tire tread wear data, calculate using the result s from example 10.2 We have Next calculate Therefore The Pearson correlation is where the sign of r follows from the sign of since 95.3% of the variation in tread wear is accounted for by linear regression on mileage, the relationship between the two is strongly linear with a negative slope.

The Maximum Likelihood Estimators (MLE) Consider the linear model: where is drawn from a normal population with mean 0 and standard deviation σ, the likelihood function for Y is: Thus, the log-likelihood for the data is:

The MLE Estimators Solving We obtain the MLEs of the three unknown model parameters The MLEs of the model parameters a and b are the same as the LSEs – both unbiased The MLE of the error variance, however, is biased:

2.3 An Unbiased Estimator of s2 An unbiased estimate of is given by Example 10.4(Tire Tread Wear Vs. Mileage: Estimate of Find the estimate of for the tread wear data using the results from Example 10.3 We have SSE=2351.3 and n-2=7,therefore Which has 7 d.f. The estimate of is miles.

Under the normal error assumption * Point estimators: 3. Statistical Inference on b0 and b1 Under the normal error assumption * Point estimators: * Sampling distributions of and : For mathematical derivations, please refer to the Tamhane and Dunlop text book, P331.

* Pivotal Quantities (P.Q.’s): Statistical Inference on b0 and b1 , Con’t * Pivotal Quantities (P.Q.’s): * Confidence Intervals (CI’s):

Statistical Inference on b0 and b1 , Con’t * Hypothesis tests: -- Test statistics: -- At the significance level , we reject in favor of if and only if (iff) -- The first test is used to show whether there is a linear relationship between x and y

-- a sum of squares divided by its d.f. Analysis of Variance (ANOVA), Con’t Mean Square: -- a sum of squares divided by its d.f.

Analysis of Variance (ANOVA) ANOVA Table Example: Source of Variation (Source) Sum of Squares (SS) Degrees of Freedom (d.f.) Mean Square (MS) F Regression Error SSR SSE 1 n - 2 Total SST n - 1 Source SS d.f. MS F Regression Error 50,887.20 2531.53 1 7 361.25 140.71 Total 53,418.73 8

4. Regression Diagnostics 4.1 Checking for Model Assumptions Checking for Linearity Checking for Constant Variance Checking for Normality Checking for Independence

Checking for Linearity Xi =Mileage Y=β0 + β1 x Yi =Groove Depth ^ ^ ^ ^ Y=β0 + β1 x Yi =fitted value ^ ei =residual Residual = ei = Yi- Yi i Xi Yi ^ ei 1 394.33 360.64 33.69 2 4 329.50 331.51 -2.01 3 8 291.00 302.39 -11.39 12 255.17 273.27 -18.10 5 16 229.33 244.15 -14.82 6 20 204.83 215.02 -10.19 7 24 179.00 185.90 -6.90 28 163.83 156.78 7.05 9 32 150.33 127.66 22.67

Checking for Normality

Checking for Constant Variance Var(Y) is not constant. A sample residual plots when Var(Y) is constant.

Checking for Independence Does not apply for Simple Linear Regression Model Only apply for time series data

4.2 Checking for Outliers & Influential Observations What is OUTLIER Why checking for outliers is important Mathematical definition How to deal with them

4.2-A. Intro Recall Box and Whiskers Plot (Chapter 4 of T&D) Where (mild) OUTLIER is defined as any observations that lies outside of Q1-(1.5*IQR) and Q3+(1.5*IQR) (Interquartile range, IQR = Q3 − Q1) (Extreme) OUTLIER as that lies outside of Q1-(3*IQR) and Q3+(3*IQR) Observation "far away" from the rest of the data

4.2-B. Why are outliers a problem? May indicate a sample peculiarity or a data entry error or other problem ; Regression coefficients estimated that minimize the Sum of Squares for Error (SSE) are very sensitive to outliers >>Bias or distortion of estimates; Any statistical test based on sample means and variances can be distorted In the presence of outliers >>Distortion of p-values; Faulty conclusions. Example: ( Estimators not sensitive to outliers are said to be robust ) Sorted Data Median Mean Variance 95% CI for mean Real Data 1 3 5 9 12 5 6.0 20.6 [0.45, 11.55] Data with Error 1 3 5 9 120 27.6 2676.8 [-36.630,91.83]

4.2-C. Mathematical Definition Outlier The standardized residual is given by If |ei*|>2, then the corresponding observation may be regarded an outlier. Example: (Tire Tread Wear vs. Mileage) STUDENTIZED RESIDUAL: a type of standardized residual calculated with the current observation deleted from the analysis. The LS fit can be excessively influenced by observation that is not necessarily an outlier as defined above. i 1 2 3 4 5 6 7 8 9 ei* 2.25 -0.12 -0.66 -1.02 -0.83 -0.57 -0.40 0.43 1.51

4.2-C. Mathematical Definition Influential Observation Observation with extreme x-value, y-value, or both. On average hii is (k+1)/n, regard any hii>2(k+1)/n as high leverage; If xi deviates greatly from mean x, then hii is large; Standardized residual will be large for a high leverage observation; Influence can be thought of as the product of leverage and outlierness. Example: (Observation is influential/ high leverage, but not an outlier) eg.1 with without eg.2 scatter plot residual plot

4.2-C. SAS code of the tire example Data tire; Input x y; Datalines; 0 394.33 4 329.50 … 32 150.33; Run; proc reg data=tire; model y=x; output out=resid rstudent=r h=lev cookd=cd dffits=dffit; proc print data=resid; where abs(r)>=2 or lev>(4/9) or cd>(4/9) or abs(dffit)>(2*sqrt(1/9)); run;

4.2-C. SAS output of the tire example

4.2-D. How to deal with Outliers & Influential Observations Investigate (Data errors? Rare events? Can be corrected?) Ways to accommodate outliers Non Parametric Methods (robust to outliers) Data Transformations Deletion (or report model results both with and without the outliers or influential observations to see how much they change)

4.3 Data Transformations Reason To achieve linearity To achieve homogeneity of variance To achieve normality or symmetry about the regression equation

Types of Transformation Linearzing Transformation transformation of a response variable, or predicted variable, or both, which produces an approximate linear relationship between variables. Variance Stabilizing Transformation make transformation if the constant variance assumption is violated

Linearizing Transformation Use mathematical operation, e.g. square root, power, log, exponential, etc. Only one variable needs to be transformed in the simple linear regression. Which one? Predictor or Response? Why?

e.g. We take a log transformation on Y = a exp (-bx) <=> log Y = log a - b x Xi Yi ^ log Yi Y = exp (logYi) Ei 394.33 5.926 374.64 19.69 4 329.50 5.807 332.58 -3.08 8 291.00 5.688 295.24 -4.24 12 255.17 5.569 262.09 -6.92 16 229.33 5.450 232.67 -3.34 20 204.83 5.331 206.54 -1.71 24 179.00 5.211 183.36 -4.36 28 163.83 5.092 162.77 1.06 32 150.33 4.973 144.50 5.83

Variance Stabilizing Transformation Delta method : Two terms Taylor-series approximations Var( h(Y)) ≈ [h(m)]2 g2 (m) where Var(Y) = g2(m), E(Y) = m set [h’(m)]2 g2 (m) = 1 h’(m) = h(m) =    h(y) = e.g. Var(Y) = c2 m2 , where c > 0, g(m) = cm ↔ g(y) = cy h(y) = = = Therefore it is the logarithmic transformation

5. Correlation Analysis Pearson Product Moment Correlation: a measurement of how closely two variables share a linear relationship. Useful when it is not possible to determine which variable is the predictor and which is the response. Health vs wealth. Which is predictor? Which is response?

Statistical Inference on the Correlation Coefficient ρ We can derive a test on the correlation coefficient in the same way that we have been doing in class. Assumptions X, Y are from the bivariate normal distribution Start with point estimator r: sample correlation coefficient: estimator of the population correlation coefficient ρ Get the pivotal quantity The distribution of r is quite complicated T0: test statistic for ρ = 0 Do we know everything about the p.q.? Yes: T ~ tn-2 under H0 : ρ=0

Bivariate Normal Distribution pdf: Properties μ1, μ2 means for X, Y σ12, σ22 variances for X, Y ρ the correlation coeff between X, Y

Derivation of T0 Therefore, we can use t as a statistic for testing against the null hypothesis H0: β1=0 Equivalently, we can test against H0: ρ=0

Exact Statistical Inference on ρ Example A researcher wants to determine if two test instruments give similar results. The two test instruments are administered to a sample of 15 students. The correlation coefficient between the two sets of scores is found to be 0.7. Is this correlation statistically significant at the .01 level? H0 : ρ=0 , Ha : ρ≠0 for α = .01, 3.534 = t0 > t13, .005 = 3.012 ▲ Reject H0 Test H0 : ρ=0 , Ha : ρ≠0 Test statistic: Reject H0 iff

Approximate Statistical Inference on ρ There is no exact method of testing ρ vs an arbitrary ρ0 Distribution of R is very complicated T0 ~ tn-2 only when ρ = 0 To test ρ vs an arbitrary ρ0 one can use Fisher’s transformation Therefore, let

Approximate Statistical Inference on ρ Sample estimate: Z test statistic: CI for ρ: We reject H0 if |z0| > zα/2

Approximate Statistical Inference on ρ using SAS Code: Output:

Pitfalls of Regression and Correlation Analysis Correlation and causation Ticks cause good health Coincidental data Sun spots and republicans Lurking variables Church, suicide, population Restricted range Local, global linearity

Summary Linear regression analysis Model Assumptions Correlation Coefficient r The Least squares (LS) estimates: b0 and b1 Probabilistic model for Linear regression: Correlation Analysis Outliers? Influential Observations? Data Transformations? Confidence Interval & Prediction interval

Sample correlation coefficient r Least Squares (LS) Fit Sample correlation coefficient r Statistical inference on ß0 & ß1 Prediction Interval Model Assumptions Correlation Analysis Linearity Constant Variance Normality Independence

Questions?