Presentation is loading. Please wait.

Presentation is loading. Please wait.

SIMPLE LINEAR REGRESSION AND CORRELATION Prepared by: Jackie Zerrle David Fried Chun-Hui Chung Weilai Zhou Shiyhan Zhang Alex Fields Yu-Hsun Cheng Roosevelt.

Similar presentations


Presentation on theme: "SIMPLE LINEAR REGRESSION AND CORRELATION Prepared by: Jackie Zerrle David Fried Chun-Hui Chung Weilai Zhou Shiyhan Zhang Alex Fields Yu-Hsun Cheng Roosevelt."— Presentation transcript:

1

2 SIMPLE LINEAR REGRESSION AND CORRELATION Prepared by: Jackie Zerrle David Fried Chun-Hui Chung Weilai Zhou Shiyhan Zhang Alex Fields Yu-Hsun Cheng Roosevelt Moreno AMS 572.1 DATA ANALYSIS, FALL 2007.

3 What is Regression Analysis?  A statistical methodology to estimate the relationship of a response variable to a set of predictor variables.  It is a tool for the investigation of relationships between variables.  Often used in economics – supply and demand. How does one aspect of the economy affect other parts?  Was proposed by German mathematician Gauss.

4 Linear Regression  The simplest relationship between x ( the predictor variable) and Y (the response variable) is linear.  is a random error with  represents the true but unknown mean of Y. This relationship is the true regression line.

5 Simple Linear Regression Model

6  4 Basic Assumptions: 1.The mean of is a linear function of. 2.The have a common variance, which is the same for all values of x. 3.The errors are normally distributed. 4.The errors are independent.

7 Example -- Sales vs. Advertising--  Information was given such as the cost of advertising and the sales that occurred as a result.  Make a scatter plot  To get a good fit, however, we will use the Least Squares (LS) method.

8 Example -- Sales vs. Advertising-- Data Sales($000,000s) ( ) Advertising ($000s) ( ) 2871 1431 1950 2160 1635

9  Try to fit a straight line :  Where  o = 2.5 and  Look at the deviations between the observed values and the points from the line: Example -- Sales vs. Advertising--

10 http://learning.mazoo.net/archives/000899.html Example -- Sales vs. Advertising— Scatter Plot with a Trial Straight Line fit

11 Least Squares (Cont…)  Deviations should be as small as possible.  Sum of the squared deviations:  In our example, Q=7.87  Least Squares estimates: and minimize and are denoted by and

12 Least Squares Estimates  To find and, take the first partial derivatives of Q.

13  We then set these partial derivatives equal to zero and simplify.  These are our normal equations: Normal Equations

14  Solve for and :

15  These formulas can be simplified to:

16  gives the sum of cross-products of the x ’s and Y ’s around their respective means.  and give the sums of squares of the differences between the and, and the and, respectively.  These expressions can be simplified to:

17  The least squares (LS) line, which is an estimate of the true regression line is:

18 Find the equation of the line for the number of sales due to increased advertising and n =5 which allows us to get

19  The slope and intercept estimates are:  The equation of the LS line is: Example -- Sales vs. Advertising--

20 Coefficient of Determination and Coefficient of Correlation  Residuals are used to evaluate the goodness of fit of the LS line:

21  Error sum of squares (SSE): Q min also equals:  This is the total sum of squares (SST).

22  Total Sum of Squares:  Regression Sum of Squares:, where is the ratio.

23 Sales vs. Advertising Coefficient of Determination and Correlation  Calculate r 2 and r using our data.  Next calculate SSR  Then,  Since 96.6% of the variation in sales is accounted for by linear regression on advertising, the relationship between the two is strongly linear with a positive slope.

24 Estimation of  2  Variance  2 measures the scatter of the around their means.  The unbiased estimate of the variance is given by:

25  Find the estimate of  2 using our past results  SSE = 7.87 and n -2=3; so,  The estimate of  is: Sales vs. Advertising Estimation of  2

26 Statistical Inference on   and   ⅰ. Point Estimator ⅱ. Confidence Interval ⅲ. Test

27 Distributions of and Point estimators 100(1 -)% IC

28 Hypothesis test  P.Q.  Hypothesis: vs.

29  Test Statistics:  Reject region: reject H o at  if:

30 Analysis of Variance for Simple Linear Regression  The analysis of variance (ANOVA) is a statistical technique to decompose the total variability in the y i ’s into separate variance components associated with specific sources.  Mean square is a sum of squares divided by its d.f. Mean square is a sum of squares divided by its d.f.

31  Mean Square Regression  Mean Square Error

32  The ratio of MSR to MSE provides an equivalent to test the significance of the linear relationship between x and y:

33 ANOVA table Source of Variation (Source) Sum of Squares (SS) Degrees of Freedom (d.f.) Mean Square (MS) F Regression Error SSR SSE 1 n – 2 Total SST n - 1

34 Prediction of Future observations  Suppose we fix x at specified value x *  How do we predict the value of the r.v. Y *?  Point Estimator:

35 Prediction Intervals (PI)  The Confidence Intervals for Y * and E( Y *) are called Prediction Intervals.  Formulas for a 100(1-α)% PI:

36 Cautions about making predictions  Note that the PI will be shortest when x * is equal to the sample mean.  The farther away x * is from the sample mean the longer the PI will be.  Extrapolation beyond the range of the data is highly imprecise and should be avoided.

37 Example 10.8  Calculate a 95% PI for the mean groove depth of the population of all tires and for the groove depth of a single tire with a mileage of 25,000 (based on the date from earlier sections).  In previous examples, we already measured the following quantities:

38 Example 10.8 (continued)  Now we simply plug these numbers into our formulas  95% PI for E( Y *):  95% PI for Y *:

39 Calibration (Inverse Regression)  Suppose we are given μ *=E( Y *), and we want an estimate of x *.  We simply solve the linear regression formula for x * to obtain our point estimator:  Calculating the CI is more complicated and is not covered in this course.

40 Example 10.9  Estimate the mean life of a tire at wearout (62.5 mils remaining).  We want to estimate x * when μ *=62.5  From previous examples, we have calculated:  Plugging this data into our equation we get:

41 REGRESION DIAGNOSTIC  The four basic assumptions of linear regression need to be verified from data to assure that the analysis is valid. 1. The mean of is a linear function of 2. The have a common variance,which is the same for all values of 3. The errors are normally distributed. 4. The errors are independent.

42 Checking The Model Assumptions  If the model is correct, then the residuals can be viewed as the “estimates” of the random errors ‘s.  Residual plots are the primary tool.  Checking for Linearity  Checking for Constant Variance  Checking for Normality  Checking for Independence How to do this?

43 If regression of y on x is linear, then the plot of e i vs. x i should exhibit random scatter around zero. Checking for linearity

44 Example 10.10 10394.33360.6433.69 24329.50331.51-2.01 38291.00302.39-11.39 412255.17273.27-18.10 516229.33244.15-14.82 620204.83215.02-10.19 724179.00185.90-6.90 828163.83156.787.05 932150.33127.6622.67 The plot is clearly parabolic. The linear regression does not fit the data adequately. Maybe we can try a second degree model:

45 Plot vs.. Since the are linear functions of, we can also plot vs.. If the constant variance assumption is correct, The plot of vs. would be like Checking for Constant Variance

46 Checking for normality  Making a normal plot 1. The normal plot requires that the observations form a random sample with a common mean and variance. 2. The do not form such a random sample, depend on and hence are not equal. 3. Residuals using to make normal plot (They have a zero mean and an approximately constant variance.

47 Example 10.10 Checking for normality

48 A well-known statistical test is the Durbin-Watson Test 1.When d is more near 2, residuals are more independent. 2.When d is more near 0, residuals are more positively correlated. 3.When d is more near 4, residuals are more negatively correlated. Checking for Independence

49 CHECKING FOR OUTLIERS AND INFLUENTIAL OBSERVATIONS

50 Checking for Outliers Standard residuals

51 Checking for Influential Observations  An influential observation is not necessarily an outlier.  An observation can be influential because it has an extreme x -value, an extreme y -value, or both.  How can we identify influential observations?

52 Leverage

53 How to Deal with Outliers and Influential Observations? Detect outliers and influential observations Yes If they are erroneous observations or not Yes Discard these observations No Include them in the analysis No Do Analysis Two separate analyses may be done, one with and one without the outliers and influential observations.

54 Example 10.12 No.1234567891011 X888888819888 Y6.285.767.718.848.477.045.2512.505.567.916.89 ei*ei*-0.341-1.0670.5821.7351.3000.031-1.6240-1.2710.757-0.089 h ii 0.1 1

55 DATA TRANSFORMATIONS  Linearizing Transformations Simple Functional relationship  i.e power form: then produce

56 DATA TRANSFORMATIONS  Linearizing Transformations Simple Functional relationship  i.e exponential form: then produce

57 DATA TRANSFORMATIONS  Linearizing Transformations

58 Ex. 10.13 (Tire tread wear vs. Mileage: Exponential Model) DATA TRANSFORMATIONS

59  Linearizing Transformations Ex. 10.13 (Tire tread wear vs. Mileage: Exponential Model) DATA TRANSFORMATIONS

60  Linearizing Transformations Ex. 10.13 (Tire tread wear vs. Mileage: Exponential Model) DATA TRANSFORMATIONS

61  Variance Stabilizing Transformations Based on two-term Taylor-series approximations Given relationship between mean and variance: The following transformation makes variances approximately equal, even if means differ :

62 DATA TRANSFORMATIONS  Variance Stabilizing Transformations Delta Method Let: then consequently

63 DATA TRANSFORMATIONS  Variance Stabilizing Transformation Example 1 here then Example 2 here then

64 CORRELATION ANALYSIS Background on correlation  A number of different correlation coefficients are used for different situations. The best known is the Pearson product-moment correlation coefficient, which is obtained by dividing the covariance of the two variables by the product of their standard deviations. Despite its name, it was first introduced by Francis Galton.

65 CORRELATION ANALYSIS When it is not clear which is the predictor variable and which is the response variable? When both variables are random?

66 Bivariate Normal Distribution  Correlation: a measurement of how closely two variables share a linear relationship. Or the measure of independence.  If  =0, uncorrelated, that implies independence, but does not guarantee it  If  =-1 or +1, it represents perfect association  Useful when it is not possible to determine which variable is the predictor and which is the response. Health vs Wealth. Which is predictor? Which is response?

67  p.d.f. of (X,Y)  Properties p.d.f is defined at -1<<1 Undefined if =±1 and is called degenerate. The marginal p.d.f of x is The marginal p.d.f of Y is Bivariate Normal Distribution

68

69 How to calculate  Let  f (X,Y) has a covariance = w here

70 Calculation wwhere N=2 since it is bivariate bi=2, thus: wwhere

71 Calculation (cont…)

72 Statistical Inference on the Correlation Coefficient ρ  We can derive a test on the correlation coefficient in the same way that we have been doing in class. Assumptions  X, Y are from the bivariate normal distribution Start with point estimator  R: sample estimate of the population correlation coefficient ρ  Get the pivotal quantity  The distribution of R is quite complicated  T: transform the point estimator into a p.q.  Do we know everything about the p.q.?  Yes: T ~ t n-2 under H 0 : ρ=0

73 Derivation of T  Therefore, we can use t as a statistic for testing against the null hypothesis H 0 : β 1 =0  Equivalently, we can test against H 0 : ρ=0

74 Exact Statistical Inference on ρ  Test H 0 : ρ=0 vs. H a : ρ≠0  Test Statistics

75 Exact Statistical Inference on ρ (Cont.)  Reject Region : Reject H 0 if t 0 > t n-2  Example: The times for 25 soft drink deliveries (y) monitored as a function of delivery volume (x) is shown in table next page. Testing the null hypothesis that the correlation coefficient is equal to 0.

76  Data YXYXYXYXYX 716.68718.111640.331029.001017.90 311.5028.001021.00615.352652.32 312.03717.83413.50719.00918.75 414.883079.24619.7539.50819.83 613.75521.50924.001735.10410.75 Exact Statistical Inference on ρ

77  Solution The sample correlation coefficient is for α =.01, Reject H 0 Exact Statistical Inference on ρ

78 Approximate Statistical Inference on ρ  There is no exact method of testing ρ vs an arbitrary ρ 0 Distribution of R is very complicated T ~ t n-2 only when ρ = 0  To test ρ vs an arbitrary ρ 0 use Fisher’s Normal approximation  Transform the sample estimate

79  Sample estimater:  CI: Approximate Statistical Inference on ρ  Test :  Z statistic: reject H 0 if |z 0 | > z α/2

80  Code: Approximate Statistical Inference on ρ

81  Output: Approximate Statistical Inference on ρ

82  Retaking the previous example: The times for 25 soft drink deliveries (y) monitored as a function of delivery volume (x) is shown in table next page. Testing the null hypothesis that the correlation coefficient is equal to 0. Approximate Statistical Inference on ρ

83 SAS coding for last example data corr_bev; input y x; datalines; 7 16.68 3 11.5 3 12.03 4 14.88 6 13.75 7 18.11 2 8.00 7 17.83 30 79.24 5 21.5 16 40.33 10 21.00 4 13.5 6 19.75 9 24.00 10 29.00 6 15.35 7 19.00 3 9.50 17 35.1 10 17.90 26 52.32 9 18.75 8 19.83 4 10.75 ; run; proc gplot data=corr_bev; plot y*x; run; proc corr data=corr_bev outp=corr; var x y; run;

84 SAS analysis for last example

85

86 Pitfalls of Regression and Correlation Analysis  Correlation and causation Good mood cause good health  Coincidental data Baldness and lawyers  Lurking variables(third unobserved variable) Relationship between eating and weight, with unobserved variable of heredity(metabolism,and illness).  Restricted range IQ, school performance (elementary school to college) college lower IQ’s are less common so there would clearly be a decrease in the range.

87  Correlation and linearity The correlation value may not be enough to evaluate a relationship, especially in the case where the assumption of normality is incorrect. This image created by Francis Anscombe, has common mean (7.5), standard deviation(4.12), correlation (.81) and regression line y=3+.5x Pitfalls of Regression and Correlation Analysis

88 SIMPLE LINEAR REGRESSION AND CORRELATION Prepared by: Jackie Zerle David Fried Chun-Hui Chung Weilai Zhou Shiyhan Zhang Alex Fields Yu-Hsun Cheng Roosevelt Moreno AMS 572.1 DATA ANALYSIS, FALL 2007.


Download ppt "SIMPLE LINEAR REGRESSION AND CORRELATION Prepared by: Jackie Zerrle David Fried Chun-Hui Chung Weilai Zhou Shiyhan Zhang Alex Fields Yu-Hsun Cheng Roosevelt."

Similar presentations


Ads by Google