Download presentation

Presentation is loading. Please wait.

Published byRiley Lambert Modified about 1 year ago

1
Simple Linear Regression & Correlation Instructor: Prof. Wei Zhu 11/21/2013 AMS 572 Group Project

2
1.Motivation & Introduction – Lizhou Nie 2.A Probabilistic Model for Simple Linear Regression – Long Wang 3.Fitting the Simple Linear Regression Model – Zexi Han 4.Statistical Inference for Simple Linear Regression – Lichao Su 5.Regression Diagnostics – Jue Huang 6.Correlation Analysis – Ting Sun 7.Implementation in SAS – Qianyi Chen 8.Application and Summary – Jie Shuai Outline

3
1. Motivation Fig. 1.1 Simplified Model for Solar System Fig. 1.2 Obama & Romney during Presidential Election Campaign campaign-history/

4
Regression Analysis Linear Regression: Simple Linear Regression: {y, x} Multiple Linear Regression: {y; x 1, …, x p } Multivariate Linear Regression: {y 1, …, y n ; x 1, …, x p } Correlation Analysis Pearson Product-Moment Correlation Coefficient: Measurement of Linear Relationship between Two Variables Introduction

5
George Udny Yule & Karl Pearson Extention to a More Generalized Statistical Context Carl Friedrich Gauss Further Development of Least Square Theory including Gauss-Markov Theorem Adrien-Marie Legendre Earliest Form of Regression: Least Square Method History Sir Francis Galton Coining the Term “Regression”

6
Simple Linear Regression - Special Case of Linear Regression - One Response Variable to One Explanatory Variable General Setting - We Denote Explanatory Variable as X i ’s and Response Variable as Y i ’s - N Pairs of Observations {x i, y i }, i = 1 to n 2. A Probabilistic Model

7
Sketch the Graph 2. A Probabilistic Model (29, 5.5) XY

8
In Simple Linear Regression, Data is described as: Where ~ N( 0, ) The Fitted Model: Where - Intercept - Slope of Regression Line 2. A Probabilistic Model

9
3. Fitting the Simple Linear Regression Model Milage(in 1000 miles)Groove Depth (in mils) Fig 3.1. Scatter plot of tire tread wear vs. mileage. From: Statistics and Data Analysis; Tamhane and Dunlop; Prentice Hall. Table 3.1.

10
The difference between the fitted line and real data is Our goal: minimize the sum of square 3. Fitting the Simple Linear Regression Model Fig 3.2. is the vertical distance between fitted line and the real data

11
3. Fitting the Simple Linear Regression Model Least Square Method

12
3. Fitting the Simple Linear Regression Model

13
To simplify, we denote: 3. Fitting the Simple Linear Regression Model

14
Back to the example: 3. Fitting the Simple Linear Regression Model

15
Therefore, the equation of fitted line is: Not enough! 3. Fitting the Simple Linear Regression Model

16
We define: Prove: The ratio: is called the coefficient of determination 3. Fitting the Simple Linear Regression Model Check the goodness of fit of LS line SST: total sum of squares SSR: Regression sum of squares SSE: Error sum of squares

17
Back to the example: 3. Fitting the Simple Linear Regression Model Check the goodness of fit of LS line where the sign of r follows from the sign of since 95.3% of the variation in tread wear is accounted for by linear regression on mileage, the relationship between the two is strongly linear with a negative slope.

18
r is the sample correlation coefficient between X and Y: For the simple linear regression, 3. Fitting the Simple Linear Regression Model

19
Estimation of The variance measures the scatter of the around their means An unbiased estimate of is given by 3. Fitting the Simple Linear Regression Model

20
From the example, we have SSE= and n-2=7,therefore Which has 7 d.f. The estimate of is 3. Fitting the Simple Linear Regression Model

21
4. Statistical Inference For SLR

22
Under the normal error assumption * Point estimators: * Sampling distributions of and : 22

23
Derivation

24
For mathematical derivations, please refer to the Tamhane and Dunlop text book, P331.

25
* Pivotal Quantities (P.Q.’s): * Confidence Intervals (C.I.’s): 25 Statistical Inference on β 0 and β 1

26
A useful application is to show whether there is a linear relationship between x and y 26/69 Hypothesis tests:. Reject at level if Reject at level if

27
Mean Square: A sum of squares divided by its degrees of freedom. 27/69 Analysis of Variance (ANOVA)

28
ANOVA Table Source of Variation (Source) Sum of Squares (SS) Degrees of Freedom (d.f.) Mean Square (MS) F Regression Error SSR SSE 1 n - 2 Total SST n

29
5.1 Checking the Model Assumptions Checking for Linearity Checking for Constant Variance Checking for Normality Primary tool: residual plots 5.2 Checking for Outliers and Influential Observations Checking for Outliers Checking for Influential Observations How to Deal with Outliers and Influential Observations 5. Regression Diagnostics

30
5.1 Checking the Model Assumptions Checking for Linearity Checking for Constant Variance Checking for Normality Primary tool: residual plots 5.2 Checking for Outliers and Influential Observations Checking for Outliers Checking for Influential Observations How to Deal with Outliers and Influential Observations 5. Regression Diagnostics

31
5.1.1 Checking for Linearity i Regression Diagnostics

32
5.1.1 Checking for Linearity (Data transformation) xy x2x2 y x3x3 y xlogy x1/y xy logxy -1/xy2y2 xy3y3 xy xy logxy -1/xy xlogy x-1/y xy x2x2 y x3x3 y xy2y2 xy3y3 Figure 5.2 Typical Scatter Plot Shapes and Corresponding Linearizing Transformations 5. Regression Diagnostics

33
5.1.1 Checking for Linearity (Data transformation) i － － － － － － Regression Diagnostics

34
5.1 Checking the Model Assumptions Checking for Linearity Checking for Constant Variance Checking for Normality Primary tool: residual plots 5.2 Checking for Outliers and Influential Observations Checking for Outliers Checking for Influential Observations How to Deal with Outliers and Influential Observations 5. Regression Diagnostics

35
5.1.2 Checking for Constant Variance 5. Regression Diagnostics

36
5.1 Checking the Model Assumptions Checking for Linearity Checking for Constant Variance Checking for Normality Primary tool: residual plots 5.2 Checking for Outliers and Influential Observations Checking for Outliers Checking for Influential Observations How to Deal with Outliers and Influential Observations 5. Regression Diagnostics

37
5.1.3 Checking for normality Make a normal plot of the residuals They have a zero mean and an approximately constant variance. (assuming the other assumptions about the model are correct) 5. Regression Diagnostics

38
5.1 Checking the Model Assumptions Checking for Linearity Checking for Constant Variance Checking for Normality Primary tool: residual plots 5.2 Checking for Outliers and Influential Observations Checking for Outliers Checking for Influential Observations How to Deal with Outliers and Influential Observations 5. Regression Diagnostics

39
Outlier: an observation that does not follow the general pattern of the relationship between y and x. A large residual indicates an outlier. Standardized residuals are given by If, then the corresponding observation may be regarded as an outlier. Influential Observation: an influential observation has an extreme x-value, an extreme y- value, or both. If we express the fitted value of y as a linear combination of all the If, then the corresponding observations may be regarded as influential observation. 5. Regression Diagnostics

40
5.2 Checking for Outliers and Influential Observations Regression Diagnostics

41
clear;clc; x = [ ]; y = [ ]; y1 = log(y); %data transformation p = polyfit(x,y,1) %linear regression predicts y from x % p = polyfit(x,log(y),1) yfit = polyval(p,x) %use p to predict y yresid = y - yfit %compute the residuals %yresid = y1 - exp(yfit) %residual for transformed data ssresid = sum(yresid.^2); %residual sum of squares sstotal = (length(y)-1) * var(y); %sstotal rsq = 1 - ssresid/sstotal; %R square normplot(yresid) %normal plot for residuals [h,p,jbstat,critval]=jbtest(yresid) %test normality scatter(x,y,500,'r','.') %generate the scatter plots lsline l axis([-5,35,-10,25]) xlabel('x_i') ylabel('y_i') Title('plot of...') for i = 1:length(x) % check for outliers p(i) = yresid(i)/std(yresid)/sqrt(1-1/length(x)-(yresid(i)-mean(yresid)^2)/(yresid(i)-mean(yresid))^2) end %check for influential observations for j = 1:length(x) q(i) = 1/length(x)+(x(i)-mean(x))^2/960 end MATLAB Code for Regression Diagnostics

42
Why we need this? Regression analysis is used to model the relationship between two variables. But when there is no such distinction and both variables are random, correlation analysis is used to study the strength of the relationship. 6.1 Correlation Analysis

43
6.1 Correlation Analysis- Example Flu reported Life expectancy Economy level People who get flu shot Temperature Economic growth Figure 6.1 Example

44
6.2 Bivariate Normal Distribution Figure 6.2

45
6.2 Why introduce Bivariate Normal Distribution?

46
Define the r.v. R corresponding to r But the distribution of R is quite complicated 6.3 Statistical Inference of r Figure 6.3 r r r r f(r)

47
Test : H 0 : ρ=0, H a : ρ≠0 Test statistic: Reject H 0 iff Example A researcher wants to determine if two test instruments give similar results. The two test instruments are administered to a sample of 15 students. The correlation coefficient between the two sets of scores is found to be 0.7. Is this correlation statistically significant at the.01 level? H 0 : ρ=0, H a : ρ≠ = t 0 > t 13,.005 = So, we reject H Exact test when ρ=0

48
6.3 Note:They are the same!

49
Because that the exact distribution of R is not very useful for making inference on ρ, R.A Fisher showed that we can do the following linear transformation, to let it be approximate normal distribution. That is, 6.3 Approximate test when ρ≠0

50
1,H 0 : ρ= ρ 0 vs. H 1 : ρ ≠ ρ 0 2, point estimator 3, T.S. 4, C.I 6.3 Steps to do the approximate test on ρ

51
Lurking VariableOver extrapolation 6.4 The pitfalls of correlation analysis

52
7. Implementation in SAS statedistrictdemoc A voteAexpend A expend B prtystrAlexpend A lespend B shareA 1 "AL" "AK" "AZ" … 173 "WI" Table7.1 vote example data

53
SAS code of the vote example proc corr data=vote1; var F4 F10; run; Pearson Correlation Coefficients, N = 173 Prob > |r| under H0: Rho=0 F4F10 F Table7.2 correlation coeffients 7. Implementation in SAS proc reg data=vote1; model F4=F10; label F4=voteA; label F10=shareA; output out=fitvote residual=R; run;

54
SAS output Analysis of Variance SourceDFSum of SquaresMean SquareF ValuePr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates VariableLabelDFParameter EstimateStandard Errort Value Intercept F Table7.3 SAS output for vote example 7. Implementation in SAS

55
Figure7.1 Plot of Residual vs. ShareA for vote example 7. Implementation in SAS

56
Figure7.2 Plot of voteA vs. shareA for vote example 7. Implementation in SAS

57
SAS-Check Homoscedasticity Figure7.3 Plots of SAS output for vote example 7. Implementation in SAS

58
SAS-Check Normality of Residuals SAS code: Tests for Location: Mu0=0 TestStatisticp Value Student's tt0Pr > |t| SignM-0.5Pr >= |M| Signed RankS-170.5Pr >= |S| Tests for Normality TestStatisticp Value Shapiro-WilkW Pr < W Kolmogorov- Smirnov D Pr > D> Cramer-von MisesW-Sq Pr > W-Sq> Anderson-DarlingA-Sq Pr > A-Sq> proc univariate data=fitvote normal; var R; qqplot R / normal (Mu=est Sigma=est); run; Table7.4 SAS output for checking normality 7. Implementation in SAS

59
SAS-Check Normality of Residuals Figure7.4 Plot of Residual vs. Normal Quantiles for vote example 7. Implementation in SAS

60
Linear regression is widely used to describe possible relationships between variables. It ranks as one of the most important tools in these disciplines. Marketing/business analytics Healthcare Finance Economics Ecology/environmental science 8. Application

61
Prediction, forecasting or deduction Linear regression can be used to fit a predictive model to an observed data set of Y and X values. After developing such a model, if an additional value of X is then given without its accompanying value of Y, the fitted model can be used to make a prediction of the value of Y. 8. Application

62
Quantifying the strength of relationship Given a variable y and a number of variables X 1,..., X p that may be related to Y, linear regression analysis can be applied to assess which X j may have no relationship with Y at all, and to identify which subsets of the X j contain redundant information about Y. 8. Application

63
Example 1. Trend line 8. Application A trend line represents a trend, the long-term movement in time series data after other components have been accounted for. Trend lines are sometimes used in business analytics to show changes in data over time. Figure 8.1 Refrigerator sales over a 13-year period

64
Example 2. Clinical drug trials 8. Application Regression analysis is widely utilized in healthcare. The graph shows an example in which we investigate the relationship between protein concentration and absorbance employing linear regression analysis. Figure 8.2 BSA Protein Concentration Vs. Absorbance Experimental_Biological_Chemistry/2011/09/13

65
Summary Model Assumptions Outliers & Influential Observations Linearity, Constant Variance & Normality Data Transformation Probabilistic Models Least Square Estimate Linear Regression Analysis Statistical Inference Correlation Analysis Correlation Coefficient (Bivariate Normal Distribution, Exact T-test, Approximate Z-test.

66
Acknowledgement Sincere thanks go to Prof. Wei Zhu References Statistics and Data Analysis, Ajit Tamhane & Dorothy Dunlop. Introductory Econometrics: A Modern Approach, Jeffrey M. Wooldridge,5 th ed. etc. (web links have already been included in the slides) Acknowledgement & References

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google