Presentation on theme: "Hypothesis Testing in Linear Regression Analysis"— Presentation transcript:
1 Hypothesis Testing in Linear Regression Analysis Chapter 5Hypothesis Testing in Linear Regression Analysis
2 Learning Objectives Construct sampling distributions Understand desirable properties of estimatorsUnderstand the simple linear regression assumptions required for OLS to be the best linear unbiased estimatorUnderstand how to conduct hypothesis tests in linear regression analysisConduct hypothesis tests for the overall statistical significance of the estimated sample regression functionConduct hypothesis tests for the statistical significance of the estimated slope coefficient,Understand how to read regression output in ExcelConstruct confidence intervals around the predicted value of y.
4 Understand the Goals of Hypothesis Testing Goal: To gain insight into specific parameters that exist for a populationProblem: Because the population is unobserved the specific population parameters are unobserved.Approach:Draw a specific random sample from the population and observe sample statistics for that sample.Use those observed sample statistics as the best guess of the unobserved population parameters.Ask whether it is likely to observe the sample statistics actually observed if the unobserved population parameter equals a hypothesized value.
5 Construct Sampling Distributions A sampling distribution is the distribution of a sample statistic based on random sampling. If we want to construct a sampling distribution for the estimated slope coefficient we would draw many, many samples from the population and estimate the slope each time and then construct a relative frequency histogram
6 Relative Frequency Histogram of 100 Different Slope Coefficients
7 Understand Desirable Properties of Simple Linear Regression Estimators Two desirable properties of estimators are:Unbiased – when the average value of all possible estimators equals the true population value orEfficient – if two estimators are unbiased then one estimator is more efficient than the other estimator if it has a lower variance.
8 Visual Depiction of the Sampling Distribution of an Unbiased Estimator is the average value of all possible values of from a sample of size n from the population.
9 What does the Term Unbiased Mean? An estimator is unbiased if it equals the true population parameter on average. If an estimator is unbiased it does not mean that for one particular sample the estimated value equals the true population parameter. If an estimator is not unbiased then it is biased
12 What does the Term Efficient Mean? If both estimators are unbiased then an estimator is more efficient than another estimator if it has a lower variance. A more efficient estimator is preferred to a less efficient estimator because does a better job estimating the true population parameter (extreme values are less likely).
13 Understand the Simple Linear Regression Assumptions Required for OLS to be the Best Linear Unbiased EstimatorAssumptions Required for OLS to be Unbiased Assumption S1: The model is linear in the parametersAssumption S2: The data are collected through independent, random samplingAssumption S3: There must be sample variation in the independent variableAssumption S4: The error term has zero meanAssumption S5: The error term is uncorrelated with the independent variable and all functions of the independent variable.Additional Assumption Required for OLS to be BLUE Assumption S6: The error term has constant variance. Note that these assumptions are theoretical and typically can’t be proven or disproven.
14 Assumption S1: Linear in the Parameters This assumption states that for OLS to be unbiased, the population model must be correctly specified as linear in the parameters.
15 When is Assumption S1 Violated? If the population regression model is non-linear in the parameters, i.e.If the true population model is not specified correctly, i.e. if the true model isbut the model on the previous slide is the one that is estimated.
16 Assumption S2: The Data are Collected through Simple Random Sampling This assumption states that for OLS to be unbiased, the data must be obtained through simple random sampling. This assumption ensures that the observations are statistically independent of each other across the units of observations.
17 When is Assumption S2 Violated? If the data are time series data such as GDP and interest rates for the US collected over time. In this circumstance observations from this time period are likely related to observations in previous time periods.If there is some type of selection bias in the sampling. For example, if individuals opt to be in a job training program, go to college, or the response rate for a survey is low.
18 Assumption S3: There Must be Sample Variation in the Independent Variables This assumption states that for OLS to be unbiased, the independent variable cannot be all the same value or This assumption ensures that slope estimator is defined. If there is no sample variation in the independent variable then will be undefined. This assumption is almost never violated.
20 Assumption S4: The Error Term has Zero Mean This assumption states that for OLS to be unbiased, the average value of the population error term is zero or This assumption will hold as long as an intercept is included in the model. This is because if the average value of the error term equals a value other than zero then the intercept will change accordingly.
21 Assumption S5: The Error Term is Not Correlated with the Independent Variable or Any Function of the Independent VariableThis assumption states that for OLS to be unbiased, the error term is uncorrelated with the independent variable and all functions of the independent variable This is read as the expected value of ε given x is equal to 0.
22 How to Determine if Assumption S5 Violated? Think of all the factors that affect the dependent variable that are not specified in the model. For the salary vs. education example variables that are in the error term include experience, ability, job type, gender, and many other factors.If any of these factors, say ability, are related to the independent variable, education, then violation S5 is violated.Note that the error term is never observed so determining whether S5 is violated is only a thought experiment.
23 The Importance of S1 through S5 If assumptions S1 through S5 hold, then the OLS estimates are unbiased or in equation form and Note that in simple linear regression analysis for non-experimental data (i.e. the type of data economists use) that these assumptions almost always fail and therefore the OLS estimates are typically biased.
24 Assumption S6: The Error Term has Constant Variance This assumption states that the error term is has a constant variance or in equation form This is called homoskedasticity. If this assumption fails then the error term is heteroskedastic or the error term has a non-constant variance.
25 How to Determine if Assumption S6 Violated? Create a scatter plot of y against x and decide whether the points are scattered in a constant manner around the line.Heteroskedasiticy does not have to look like the graph on the right on the next slide, there just has to be a non-constant distribution of the data points along the line.Chapter 9 gives a more in depth coverage of this topic.
26 Visual Depiction of Homoskedasticity versus Heteroskedasticity
27 The Importance of S1 through S6 If assumptions S1 through S6 hold, then the OLS estimates are BLUE or the Best Linear Unbiased Estimators. In this instance Best means minimum variance. This means that among all linear unbiased estimators of the population slope and population intercept, the OLS estimates have the lowest variance. As before, in simple linear regression analysis in economics these assumption rarely hold.
28 Understand How to Conduct Hypothesis Tests in Linear Regression Analysis There are three different methods used to test hypotheses in this chapterConfidence IntervalCritical ValueP-valueAll three of these methods yield the same conclusion.
29 Conduct Hypothesis Tests for the Overall Significance of the Regression Function: the F-test A hypothesis of the overall significance of the regression model tests whether any of the explanatory variables have a statistically significant effect on the dependent variable In simple linear regression there is only one explanatory variable so the hypotheses are x does not affect y x affects y
30 The F-test ContinuedThe test statistic for this hypothesis is where remember k = # of explanatory variables, in this case k = 1.
31 The F-test ContinuedThe p-value for this hypothesis is computed from the F distribution and is found in the ANOVA table under the column titled Significance F. Rejection rule using the critical value: Reject H0 if the F-statistic > Fα, k, n-k-1 Rejection rule using the p-value: Reject H0 if the p-value < α
32 F-test: Excel Output Critical Value: Reject H0 if F-statistic is > 2.31Because > 2.31 we reject H0 and conclude that education affects salary at the 5% level.p-value:Reject H0 if p-value < .05Because < .05 we reject H0 and conclude that education affects salary at the 5% level.MSExplainedMSUnexplainedF-Statisticp-value
33 Conduct Hypothesis Tests for the Individual Significance of the Slope Coefficient: the t-test A hypothesis of the individual significance of the regression model tests whether one explanatory variable has a statistically significant effect on the dependent variable In simple linear regression there is only one explanatory variable so the hypotheses are x does not affect y x affects y
34 Standard Error of the Slope Coefficient To test individual significance of the slope coefficient we need to know the standard error of the slope orThe standard error of the slope decreases ifThe SSUnexplained goes downThe number of observations goes upThe variance of x increases (the square root of the numerator of the variance formula is in the denominator
35 Fitting a Better Line When There is More Variation in the Independent Variable, x
36 Where to Find the Standard Error of the Slope in Regression Output Standard Error of the Slope Coefficient
37 Using a Confidence Interval to Test for the Individual Significance of the Slope Coefficient The Hypothesis isThe formula for the confidence interval isThe rejection rule is:Reject H0 if the 0 is not within the confidence interval
38 Confidence Interval for Individual Significance of the Slope Coefficient ($4,320.86, $18,194.3)Reject H0 because 0 is not within this interval and conclude education affects salary
39 Using the p-value to Test for the Individual Significance of the Slope Coefficient The Hypothesis isThe rejection rule is:Reject H0 if the p-value < α
40 Using the p-value to Test for the Individual Significance of the Slope Coefficient The Hypothesis isThe rejection rule is:Reject H0 if the p-value < 0.05Decision:Because < 0.05 we reject H0 and conclude that education affects salary
41 Using Excel to test for the Individual Significance of the Slope Coefficient p-value =95% Confidence interval($4,332.85, $18,182.30)t-statistic = 3.75
42 Construct Confidence Intervals around the Predicted Value of y The formula for the confidence interval iswhere is the predicted value, is the critical value from the t-table, and is the standard error of the prediction. The only component that we don’t know how to obtain is the standard error of the prediction
43 The Standard Error for a prediction about the mean value of y Standard error for themean of y given a particular xp
44 Example: Salary vs. Education Estimated Regression Equation:Education in years(x)Salary in $(y)1233,00022,0001438,0001645,00040,00050,0001760,0001843,00019135,00020122,000Predict the mean salary for people with 17 years of education.
45 Example of Applying the Confidence Interval Formula for a Mean The predicted value when xp is 17 isThis says that a person with 17 years of education is predicted to earn $70,057.58The critical value is
46 Example of Applying the Confidence Interval Formula for a Mean
47 Example of Applying the Confidence Interval Formula for the Mean The 95% confidence interval when xp is 17 isThis is a 95% confidence interval around the predicted salary for the mean.
48 Confidence Interval for an Individual y, Given x Confidence interval estimate for anIndividual value of y given a particular xpThis extra term adds to the interval width to reflect the added uncertainty for an individual case
49 Example of Applying the Confidence Interval Formula for an Individual The 95% confidence interval when xp is 17 isThis is a 95% confidence interval around the predicted salary for an individual. This interval is much wider that the interval for the mean value
50 Interval Estimates for Different Values of x Prediction Interval for an individual y, given xpyConfidence Interval for the mean of y, given xpy = β0 + β1xxxpx