Presentation is loading. Please wait.

Presentation is loading. Please wait.

Simple Linear Regression AMS 572 11/29/2010. Outline 1.Brief History and Motivation – Zhen Gong 2.Simple Linear Regression Model – Wenxiang Liu 3.Ordinary.

Similar presentations


Presentation on theme: "Simple Linear Regression AMS 572 11/29/2010. Outline 1.Brief History and Motivation – Zhen Gong 2.Simple Linear Regression Model – Wenxiang Liu 3.Ordinary."— Presentation transcript:

1 Simple Linear Regression AMS 572 11/29/2010

2 Outline 1.Brief History and Motivation – Zhen Gong 2.Simple Linear Regression Model – Wenxiang Liu 3.Ordinary Least Squares Method – Ziyan Lou 4.Goodness of Fit of LS Line – Yixing Feng 5.OLS Example – Lingbin Jin 6.Statistical Inference on Parameters – Letan Lin 7.Statistical Inference Example – Emily Vo 8.Regression Diagnostics– Yang Liu 9.Correlation Analysis – Andrew Candela 10.Implementation in SAS – Joseph Chisari 2/69

3 Legendre published the earliest form of regression, which was the method of least squares in 1805. In 1809, Gauss published the same method. The method was extended by Francis Galton in the 19th century to describe a biological phenomenon. Karl Pearson and Udny Yule extended it to a more general statistical context around 20th century. Brief History and Introduction 3/69

4 Motivation for Regression Analysis Regression analysis is a statistical methodology to estimate the relationship of a response variable to a set of predictor variable. When there is just one predictor variable, we will use simple linear regression. When there are two or more predictor variables, we use multiple linear regression New observed predictor value Prediction for response variable Predict Y, based on X 4/69

5 2010 Camry: Horsepower at 6000 rpm: 169 Highway gasoline consumption: 0.03125 gallon per mile 2010 Milan: Horsepower at 6000 rpm: 175 Highway gasoline consumption: 0.0326 gallon per mile 2010 Fusion: Horsepower at 6000 rpm: 263 Highway gasoline consumption: ? Response variable (Y): Highway gasoline consumption Predictor variable (X): Horsepower at 6000 rpm Motivation for Regression Analysis 5/69

6 A summary of the relationship between a dependent variable (or response variable) Y and an independent variable (or covariate variable) X. Y is assumed to be a random variable while, even if X is a random variable, we condition on it (assume it is fixed). Essentially, we are interested in knowing the behavior of Y given we know X = x. Simple Linear Regression Model 6/69

7 Regression models attempt to minimize the distance measured vertically between the observation point and the model line (or curve). The length of the line segment is called residual, modeling error, or simply error. The negative and positive errors should cancel out ⇒ Zero overall error Many lines will satisfy this criterion. Good Model 7/69

8 Good Model 8/69

9 In simple linear regression, the population regression line was given by E(Y) = β 0 +β 1 x The actual values of Y are assumed to be the sum of the mean value, E(Y), and a random error term, ∊ : Y = E(Y) + ∊ = β 0 +β 1 x + ∊ At any given value of x, the dependent variable Y ~ N (β 0 +β 1 x, σ 2 ) Probabilistic Model 9/69

10 Least Squares (LS) Fit Boiling Point of Water in the Alps 10/69

11 Least Squares (LS) Fit Find a line that represent the ”best” linear relationship: 11/69

12 Least Squares (LS) Fit Problem: the data does not go through a line Find the line that minimizes the sum: We are looking for the line that minimizes 12/69

13 Least Squares (LS) Fit To get the parameters that make the sum of square difference become minimum, take partial derivative for each parameter and equate it with zero. 13/69

14 Least Squares (LS) Fit Solve the equations and we get 14/69

15 Least Squares (LS) Fit To simplify, we introduce The resulting equation is known as the least squares line, which is an estimate of the true regression line. 15/69

16 Goodness of Fit of the LS Line The fitted values is The residuals are used to evaluate the goodness of fit of the LS Line. 16/69

17 Goodness of Fit of the LS Line The error sum of squares SSE= The total sum of squares SST= The regression sum of squares SST=SSR+SSE 17/69

18 Goodness of Fit of the LS Line The coefficient of determination is always between 0 and 1 The sample correlation coefficient between X and Y is For the simple linear regression, 18/70

19 Estimation of the variance The variance measures the scatter of the around their means An unbiased estimate of is given by This estimate of has n-2 degrees of freedom. 19/69

20 Implementing OLS method to Problem 10.4 OLS method: The time between eruptions of Old Faithful geyser in Yellowstone National Park is random but is related to the duration of the last eruption. The table below shows these times for 21 consecutive eruptions. Obs No. LastNex t Obs No. LastNex t Obs No. LastNe xt 12.05082.857154.077 21.85793.372164.070 33.755103.562171.743 42.247113.763181.848 52.153123.870194.970 62.450134.585204.279 72.662144.775214.372 20/69

21 A scatter plot of Next vs. LAST Implementing OLS method to Problem 10.4 21/69

22 Implementing OLS method to Problem 10.4 22/69

23 When x=3, y=60 We could say that Last is a good predictor of Next Implementing OLS method to Problem 10.4 23/69

24 Final Result and are normally distributed.. Statistical Inference Statistical Inference on and 24/69

25 Set ’s as fixed and use Derivation. Statistical Inference on and 25/69

26 Derivation. Statistical Inference on and 26/69

27 Derivation. Statistical Inference on and 27/69

28 Since Pivotal Quantities (P.Q.):. Confidence Intervals (CI’s):. Statistical Inference on and 28/70

29  A useful application is to show whether there is a linear relationship between x and y 29/69 Hypothesis tests:. Reject at level if Reject at level if  One-side alternative hypotheses can be tested using one-side t-test. Statistical Inference on and

30 Mean Square: A sum of squares divided by its degrees of freedom. 30/69 Analysis of Variance (ANOVA)

31 Source of Variation (Source) Sum of Squares (SS) Degrees of Freedom (d.f.) Mean Square (MS) F Regression Error SSR SSE 1 n - 2 TotalSSTn - 1 Analysis of Variance (ANOVA) ANOVA Table: 31/69

32 Statistical Inference Example – Testing for Linear Relationship Problem 10.4 At α = 0.05, is there a linear trend between the time to the NEXT eruption and the duration of the LAST eruption? vs. Reject H 0 if where 32/69

33 Statistical Inference – Hypothesis Testing Solution: We reject H 0 and therefore conclude That there is a linear relationship between NEXT and LAST. 33/70

34 Statistical Inference Example - Confidence and Prediction Intervals Problem 10.11 from Tamane & Dunlop Statistics and Data Analysis 10.11 (a) Calculate a 95% PI for the time to the next eruption if the last eruption lasted 3 minutes. 34/69

35 Problem 10.11 – Prediction Interval Solution: The formula for a 100(1-α)% PI for a future observation is given by 35/69

36 Problem 10.11 - Prediction Interval 36/69

37 Problem 10.11 - Confidence Interval 10.11(b) Calculate a 95% CI for the mean time to the next eruption for a last eruption lasting 3 minutes. Compare this confidence interval with the PI obtained in (a) 37/69

38 Problem 10.11 - Confidence Interval Solution: The formula for a 100(1-α)% CI for is given by where The 95% CI is [57.510, 63.257] The CI is shorter than the PI 38/70

39 Regression Diagnostics Checking the Model Assumptions 1. is a linear function of 2. is the same for all 3.The errors are normally distributed 4.The errors are independent(for time series data) Checking for Outliers and Influential Observations 39/69

40 Checking the Model Assumptions Residuals: can be viewed as the “estimates” of random errors 40/69

41 Checking for Linearity If regression of on is linear, then the plot of vs. should exhibit random scatter around zero 41/69

42 10394.33360.6433.69 24329.50331.51-2.01 38302.39 -11.39 412273.27 -18.10 516244.15 -14.82 620215.02 -10.19 724185.90 -6.90 828156.78 7.05 932127.66 22.67 Checking for Linearity Tire Wear Data x y 42/69

43 Checking for Linearity 10394.33360.6433.69 24329.50331.51-2.01 38302.39 -11.39 412273.27 -18.10 516244.15 -14.82 620215.02 -10.19 724185.90 -6.90 828156.78 7.05 932127.66 22.67 Tire Wear Data x Residual 43/69

44 Checking for Linearity Data Transformation 44/69

45 Checking for Constant Variance If the constant variance assumption is correct, the dispersion of the is approximately constant with respect to the 45/69

46 Residual e Checking for Constant Variance Example from textbook 10.21 46/69

47 Checking for Normality We can use residuals to make a normal plot Example from textbook 10.21 Normal plot of residuals 47/69

48 Checking for Outliers Definition: An outlier is an observation that does not follow the general pattern of the relationship between and A large residual indicates an outlier!! 48/69

49 Checking for Influential Observations An observation can be influential because it has an extreme x-value, an y-value, or both A large indicates an influential observation!! k: # of predictors 49/69

50 Checking for Influential Observations 50/69

51 Why use Correlation analysis? If the nature of the relationship between X and Y is not known, we can investigate the correlation between them without making any assumptions of causality. In order to do this, assume (X,Y) follows the bivariate normal distribution. 51/69

52 The Bivariate Normal Distribution (X,Y) has the following distribution: 52/69

53 Why can we do this? This assumption reduces to the probabilistic model for linear regression since the conditional distribution of Y given X=x is normal with the following parameters: So when X=x the mean of Y is a linear function of x and the variance is constant w.r.t. x. 53/69

54 So what? Under these assumptions we can use the data available to make inferences about ρ. First we have to estimate ρ from the data. Define the sample correlation coefficient R: 54/69

55 How can we use this? The exact distribution of R is very complicated, but we do have some options. Under the null Hypothesis H 0 :ρ 0 =0 the distribution of R is simplified. An exact test exists in this case. For arbitrary values of ρ 0 we can approximate a function of R with a normal distribution thanks to R.A. Fisher. 55/69

56 Testing H 0 : ρ 0 =0 Under H 0 the distribution of is t (n-2). This is kind of surprising, but think about it. The test statistic we used to test β 10 =0 is distributed as t (n-2) and ρ=0 if and only if β 1 =0. That the two test statistics are equivalent is shown on page 382-383 of the text. 56/69

57 Approximation of R Fisher showed that for n even as small as 10 Now we can test H 0 : ρ= ρ 0 vs. H 1 : ρ ≠ ρ 0 for arbitrary ρ 0. We just compute: 57/70

58 Almost Finished! We now have the tools necessary for inference on ρ. For a confidence interval for ρ compute: and solve for: 58/69

59 Correlation - Conclusion When we are not sure of the relationship between X and Y assume (X i,Y i ) is an observation from a bivariate normal distribution. To test H 0 : ρ= ρ 0 vs H 1 : ρ ≠ ρ 0 at significance level α just compare : to But if ρ 0 =0 compare to t(n-2,α) 59/69

60 SAS - Reg Procedure Proc Reg Data=Regression_Example; Title "Regresion Example"; Model Next = Last; Plot Next*Last; Plot Residual.*Predicted.; Output Out=Data_From_Regression Residual=R Predicted=PV; Run; 60/69

61 Proc Reg Output 61/70

62 Plot Next*Last 62/7 0

63 SAS - Plotting Regression Line Symbol1 Value=Dot C=blue I=R; Symbol2 Value=None C=red I=RLCLM95; Proc Gplot Data=Regression_Example; Title "Regression Line and CIs"; Plot Next*Last=1 Next*Last=2/Overlay; Run; 63/70

64 Plotting Regression Line 64/70

65 SAS - Checking Homoscedasticity Proc Reg Data=Regression_Example; Title "Regresion Example"; Model Next = Last; Plot Next*Last; Plot Residual.*Predicted.; Output Out=Data_From_Regression Residual=R Predicted=PV; Run; 65/69

66 Predicted.*Residual. 66/6 9

67 SAS - Checking Normality of Residuals Proc Reg Data=Regression_Example; Output Out=Data_From_Regression Residual=R Predicted=PV; Proc Univariate Data=Data_From_Regression Normal; Var R; qqplot R / Normal(Mu=est Sigma=est); Run; 67/69

68 Checking for Normality 68/69

69 Questions? 69/69


Download ppt "Simple Linear Regression AMS 572 11/29/2010. Outline 1.Brief History and Motivation – Zhen Gong 2.Simple Linear Regression Model – Wenxiang Liu 3.Ordinary."

Similar presentations


Ads by Google