# Simple Linear Regression & Correlation Instructor: Prof. Wei Zhu 11/21/2013 AMS 572 Group Project.

## Presentation on theme: "Simple Linear Regression & Correlation Instructor: Prof. Wei Zhu 11/21/2013 AMS 572 Group Project."— Presentation transcript:

Simple Linear Regression & Correlation Instructor: Prof. Wei Zhu 11/21/2013 AMS 572 Group Project

1.Motivation & Introduction – Lizhou Nie 2.A Probabilistic Model for Simple Linear Regression – Long Wang 3.Fitting the Simple Linear Regression Model – Zexi Han 4.Statistical Inference for Simple Linear Regression – Lichao Su 5.Regression Diagnostics – Jue Huang 6.Correlation Analysis – Ting Sun 7.Implementation in SAS – Qianyi Chen 8.Application and Summary – Jie Shuai Outline

1. Motivation http://popperfont.net/2012/11/13/the-ultimate-solar-system-animated-gif/ Fig. 1.1 Simplified Model for Solar System Fig. 1.2 Obama & Romney during Presidential Election Campaign http://outfront.blogs.cnn.com/2012/08/14/the-most-negative-in- campaign-history/

Regression Analysis  Linear Regression: Simple Linear Regression: {y, x} Multiple Linear Regression: {y; x 1, …, x p } Multivariate Linear Regression: {y 1, …, y n ; x 1, …, x p } Correlation Analysis  Pearson Product-Moment Correlation Coefficient: Measurement of Linear Relationship between Two Variables Introduction

George Udny Yule & Karl Pearson  Extention to a More Generalized Statistical Context Carl Friedrich Gauss  Further Development of Least Square Theory including Gauss-Markov Theorem Adrien-Marie Legendre  Earliest Form of Regression: Least Square Method History Sir Francis Galton  Coining the Term “Regression” http://en.wikipedia.org/wiki/Regression_analysis http://en.wikipedia.org/wiki/Adrien_Marie_Legendre http://en.wikipedia.org/wiki/Carl_Friedrich_Gauss http://en.wikipedia.org/wiki/Francis_Galton http://www.york.ac.uk/depts/maths/histstat/people/yule.gif http://en.wikipedia.org/wiki/Karl_Pearson

Simple Linear Regression - Special Case of Linear Regression - One Response Variable to One Explanatory Variable General Setting - We Denote Explanatory Variable as X i ’s and Response Variable as Y i ’s - N Pairs of Observations {x i, y i }, i = 1 to n 2. A Probabilistic Model

Sketch the Graph 2. A Probabilistic Model (29, 5.5) XY 137.709.82 216.315.00 328.379.27 4-12.132.98 989.067.34 9928.5410.37 100-17.192.33

In Simple Linear Regression, Data is described as: Where ~ N( 0, ) The Fitted Model: Where - Intercept - Slope of Regression Line 2. A Probabilistic Model

3. Fitting the Simple Linear Regression Model Milage(in 1000 miles)Groove Depth (in mils) 0394.33 4329.50 8291.00 12255.17 16229.33 20204.83 24179.00 28163.83 32150.33 Fig 3.1. Scatter plot of tire tread wear vs. mileage. From: Statistics and Data Analysis; Tamhane and Dunlop; Prentice Hall. Table 3.1.

The difference between the fitted line and real data is Our goal: minimize the sum of square 3. Fitting the Simple Linear Regression Model Fig 3.2. is the vertical distance between fitted line and the real data

3. Fitting the Simple Linear Regression Model Least Square Method

3. Fitting the Simple Linear Regression Model

To simplify, we denote: 3. Fitting the Simple Linear Regression Model

Back to the example: 3. Fitting the Simple Linear Regression Model

Therefore, the equation of fitted line is: Not enough! 3. Fitting the Simple Linear Regression Model

We define: Prove: The ratio: is called the coefficient of determination 3. Fitting the Simple Linear Regression Model Check the goodness of fit of LS line SST: total sum of squares SSR: Regression sum of squares SSE: Error sum of squares

Back to the example: 3. Fitting the Simple Linear Regression Model Check the goodness of fit of LS line where the sign of r follows from the sign of since 95.3% of the variation in tread wear is accounted for by linear regression on mileage, the relationship between the two is strongly linear with a negative slope.

r is the sample correlation coefficient between X and Y: For the simple linear regression, 3. Fitting the Simple Linear Regression Model

Estimation of The variance measures the scatter of the around their means An unbiased estimate of is given by 3. Fitting the Simple Linear Regression Model

From the example, we have SSE=2351.3 and n-2=7,therefore Which has 7 d.f. The estimate of is 3. Fitting the Simple Linear Regression Model

4. Statistical Inference For SLR

Under the normal error assumption * Point estimators: * Sampling distributions of and : 22

Derivation

For mathematical derivations, please refer to the Tamhane and Dunlop text book, P331.

* Pivotal Quantities (P.Q.’s): * Confidence Intervals (C.I.’s): 25 Statistical Inference on β 0 and β 1

 A useful application is to show whether there is a linear relationship between x and y 26/69 Hypothesis tests:. Reject at level if Reject at level if

Mean Square: A sum of squares divided by its degrees of freedom. 27/69 Analysis of Variance (ANOVA)

ANOVA Table Source of Variation (Source) Sum of Squares (SS) Degrees of Freedom (d.f.) Mean Square (MS) F Regression Error SSR SSE 1 n - 2 Total SST n - 1 28

5.1 Checking the Model Assumptions 5.1.1 Checking for Linearity 5.1.2 Checking for Constant Variance 5.1.3 Checking for Normality  Primary tool: residual plots 5.2 Checking for Outliers and Influential Observations 5.2.1 Checking for Outliers 5.2.2 Checking for Influential Observations 5.2.3 How to Deal with Outliers and Influential Observations 5. Regression Diagnostics

5.1 Checking the Model Assumptions 5.1.1 Checking for Linearity 5.1.2 Checking for Constant Variance 5.1.3 Checking for Normality  Primary tool: residual plots 5.2 Checking for Outliers and Influential Observations 5.2.1 Checking for Outliers 5.2.2 Checking for Influential Observations 5.2.3 How to Deal with Outliers and Influential Observations 5. Regression Diagnostics

5.1.1 Checking for Linearity i 10394.33360.6433.69 24329.50331.51-2.01 38291.00302.39-11.39 412255.17273.27-18.10 516229.33244.15-14.82 620204.83215.02-10.19 724179.00185.90-6.90 828163.83156.787.05 932150.33127.6622.67 5. Regression Diagnostics

5.1.1 Checking for Linearity (Data transformation) xy x2x2 y x3x3 y xlogy x1/y xy logxy -1/xy2y2 xy3y3 xy xy logxy -1/xy xlogy x-1/y xy x2x2 y x3x3 y xy2y2 xy3y3 Figure 5.2 Typical Scatter Plot Shapes and Corresponding Linearizing Transformations 5. Regression Diagnostics

5.1.1 Checking for Linearity (Data transformation) i 10394.335.926374.6419.69 24329.505.807332.58 － 3.08 38291.005.688295.24 － 4.24 412255.175.569262.09 － 6.92 516229.335.450232.67 － 3.34 620204.835.331206.54 － 1.71 724179.005.211183.36 － 4.36 828163.835.092162.771.06 932150.334.973144.505.83 5. Regression Diagnostics

5.1 Checking the Model Assumptions 5.1.1 Checking for Linearity 5.1.2 Checking for Constant Variance 5.1.3 Checking for Normality  Primary tool: residual plots 5.2 Checking for Outliers and Influential Observations 5.2.1 Checking for Outliers 5.2.2 Checking for Influential Observations 5.2.3 How to Deal with Outliers and Influential Observations 5. Regression Diagnostics

5.1.2 Checking for Constant Variance 5. Regression Diagnostics

5.1 Checking the Model Assumptions 5.1.1 Checking for Linearity 5.1.2 Checking for Constant Variance 5.1.3 Checking for Normality  Primary tool: residual plots 5.2 Checking for Outliers and Influential Observations 5.2.1 Checking for Outliers 5.2.2 Checking for Influential Observations 5.2.3 How to Deal with Outliers and Influential Observations 5. Regression Diagnostics

5.1.3 Checking for normality Make a normal plot of the residuals They have a zero mean and an approximately constant variance. (assuming the other assumptions about the model are correct) 5. Regression Diagnostics

5.1 Checking the Model Assumptions 5.1.1 Checking for Linearity 5.1.2 Checking for Constant Variance 5.1.3 Checking for Normality  Primary tool: residual plots 5.2 Checking for Outliers and Influential Observations 5.2.1 Checking for Outliers 5.2.2 Checking for Influential Observations 5.2.3 How to Deal with Outliers and Influential Observations 5. Regression Diagnostics

Outlier: an observation that does not follow the general pattern of the relationship between y and x. A large residual indicates an outlier. Standardized residuals are given by If, then the corresponding observation may be regarded as an outlier. Influential Observation: an influential observation has an extreme x-value, an extreme y- value, or both. If we express the fitted value of y as a linear combination of all the If, then the corresponding observations may be regarded as influential observation. 5. Regression Diagnostics

5.2 Checking for Outliers and Influential Observations 12.8653 2-0.4113 3-0.5367 4-0.8505 5-0.4067 6-0.2102 7-0.5519 80.1416 90.8484 10.3778 20.2611 30.1778 40.1278 50.1111 60.1278 70.1778 80.2611 90.3778 5. Regression Diagnostics

clear;clc; x = [0 4 8 12 16 20 24 28 32]; y = [394.33 329.50 291.00 255.17 229.33 204.83 179.00 163.83 150.33]; y1 = log(y); %data transformation p = polyfit(x,y,1) %linear regression predicts y from x % p = polyfit(x,log(y),1) yfit = polyval(p,x) %use p to predict y yresid = y - yfit %compute the residuals %yresid = y1 - exp(yfit) %residual for transformed data ssresid = sum(yresid.^2); %residual sum of squares sstotal = (length(y)-1) * var(y); %sstotal rsq = 1 - ssresid/sstotal; %R square normplot(yresid) %normal plot for residuals [h,p,jbstat,critval]=jbtest(yresid) %test normality scatter(x,y,500,'r','.') %generate the scatter plots lsline l axis([-5,35,-10,25]) xlabel('x_i') ylabel('y_i') Title('plot of...') for i = 1:length(x) % check for outliers p(i) = yresid(i)/std(yresid)/sqrt(1-1/length(x)-(yresid(i)-mean(yresid)^2)/(yresid(i)-mean(yresid))^2) end %check for influential observations for j = 1:length(x) q(i) = 1/length(x)+(x(i)-mean(x))^2/960 end MATLAB Code for Regression Diagnostics

Why we need this? Regression analysis is used to model the relationship between two variables. But when there is no such distinction and both variables are random, correlation analysis is used to study the strength of the relationship. 6.1 Correlation Analysis

6.1 Correlation Analysis- Example Flu reported Life expectancy Economy level People who get flu shot Temperature Economic growth Figure 6.1 Example

6.2 Bivariate Normal Distribution Figure 6.2

6.2 Why introduce Bivariate Normal Distribution?

Define the r.v. R corresponding to r But the distribution of R is quite complicated 6.3 Statistical Inference of r Figure 6.3 r r r r f(r) -0.7 -0.3 0.5 0

Test : H 0 : ρ=0, H a : ρ≠0 Test statistic: Reject H 0 iff Example A researcher wants to determine if two test instruments give similar results. The two test instruments are administered to a sample of 15 students. The correlation coefficient between the two sets of scores is found to be 0.7. Is this correlation statistically significant at the.01 level? H 0 : ρ=0, H a : ρ≠0 3.534 = t 0 > t 13,.005 = 3.012 So, we reject H 0 6.3 Exact test when ρ=0

6.3 Note:They are the same!

Because that the exact distribution of R is not very useful for making inference on ρ, R.A Fisher showed that we can do the following linear transformation, to let it be approximate normal distribution. That is, 6.3 Approximate test when ρ≠0

1,H 0 : ρ= ρ 0 vs. H 1 : ρ ≠ ρ 0 2, point estimator 3, T.S. 4, C.I 6.3 Steps to do the approximate test on ρ

Lurking VariableOver extrapolation 6.4 The pitfalls of correlation analysis

7. Implementation in SAS statedistrictdemoc A voteAexpend A expend B prtystrAlexpend A lespend B shareA 1 "AL"7168328.38.74415.7939162.16756797.41 2 "AK"1062626.38402.48606.4399525.99763860.88 3 "AZ"217399.613.07554.6012331.12004897.01 … 173 "WI"813014.42227.82472.6686855.4285695.95 Table7.1 vote example data

SAS code of the vote example proc corr data=vote1; var F4 F10; run; Pearson Correlation Coefficients, N = 173 Prob > |r| under H0: Rho=0 F4F10 F4 1.000000.92528 Table7.2 correlation coeffients 7. Implementation in SAS proc reg data=vote1; model F4=F10; label F4=voteA; label F10=shareA; output out=fitvote residual=R; run;

SAS output Analysis of Variance SourceDFSum of SquaresMean SquareF ValuePr > F Model141486 1017.70<.0001 Error1716970.7736440.76476 Corrected Total17248457 Root MSE6.38473R-Square0.8561 Dependent Mean50.50289Adj R-Sq0.8553 Coeff Var12.64230 Parameter Estimates VariableLabelDFParameter EstimateStandard Errort Value Intercept 126.812540.8871930.22 F10 10.463820.0145431.90 Table7.3 SAS output for vote example 7. Implementation in SAS

Figure7.1 Plot of Residual vs. ShareA for vote example 7. Implementation in SAS

Figure7.2 Plot of voteA vs. shareA for vote example 7. Implementation in SAS

SAS-Check Homoscedasticity Figure7.3 Plots of SAS output for vote example 7. Implementation in SAS

SAS-Check Normality of Residuals SAS code: Tests for Location: Mu0=0 TestStatisticp Value Student's tt0Pr > |t|1.0000 SignM-0.5Pr >= |M|1.0000 Signed RankS-170.5Pr >= |S|0.7969 Tests for Normality TestStatisticp Value Shapiro-WilkW0.952811Pr < W0.7395 Kolmogorov- Smirnov D0.209773Pr > D>0.1500 Cramer-von MisesW-Sq0.056218Pr > W-Sq>0.2500 Anderson-DarlingA-Sq0.30325Pr > A-Sq>0.2500 proc univariate data=fitvote normal; var R; qqplot R / normal (Mu=est Sigma=est); run; Table7.4 SAS output for checking normality 7. Implementation in SAS

SAS-Check Normality of Residuals Figure7.4 Plot of Residual vs. Normal Quantiles for vote example 7. Implementation in SAS

Linear regression is widely used to describe possible relationships between variables. It ranks as one of the most important tools in these disciplines.  Marketing/business analytics  Healthcare  Finance  Economics  Ecology/environmental science 8. Application

Prediction, forecasting or deduction  Linear regression can be used to fit a predictive model to an observed data set of Y and X values. After developing such a model, if an additional value of X is then given without its accompanying value of Y, the fitted model can be used to make a prediction of the value of Y. 8. Application

Quantifying the strength of relationship  Given a variable y and a number of variables X 1,..., X p that may be related to Y, linear regression analysis can be applied to assess which X j may have no relationship with Y at all, and to identify which subsets of the X j contain redundant information about Y. 8. Application

Example 1. Trend line 8. Application A trend line represents a trend, the long-term movement in time series data after other components have been accounted for. Trend lines are sometimes used in business analytics to show changes in data over time. Figure 8.1 Refrigerator sales over a 13-year period http://www.likeoffice.com/28057/Excel-2007-Formatting-charts

Example 2. Clinical drug trials 8. Application Regression analysis is widely utilized in healthcare. The graph shows an example in which we investigate the relationship between protein concentration and absorbance employing linear regression analysis. Figure 8.2 BSA Protein Concentration Vs. Absorbance http://openwetware.org/wiki/User:Laura_Flynn/Notebook/ Experimental_Biological_Chemistry/2011/09/13

Summary Model Assumptions Outliers & Influential Observations Linearity, Constant Variance & Normality Data Transformation Probabilistic Models Least Square Estimate Linear Regression Analysis Statistical Inference Correlation Analysis Correlation Coefficient (Bivariate Normal Distribution, Exact T-test, Approximate Z-test.

Acknowledgement Sincere thanks go to Prof. Wei Zhu References Statistics and Data Analysis, Ajit Tamhane & Dorothy Dunlop. Introductory Econometrics: A Modern Approach, Jeffrey M. Wooldridge,5 th ed. http://en.wikipedia.org/wiki/Regression_analysis http://en.wikipedia.org/wiki/Adrien_Marie_Legendre etc. (web links have already been included in the slides) Acknowledgement & References

Download ppt "Simple Linear Regression & Correlation Instructor: Prof. Wei Zhu 11/21/2013 AMS 572 Group Project."

Similar presentations