# Hypothesis Testing in Linear Regression Analysis

## Presentation on theme: "Hypothesis Testing in Linear Regression Analysis"— Presentation transcript:

Hypothesis Testing in Linear Regression Analysis
Chapter 5 Hypothesis Testing in Linear Regression Analysis

Learning Objectives Construct sampling distributions
Understand desirable properties of estimators Understand the simple linear regression assumptions required for OLS to be the best linear unbiased estimator Understand how to conduct hypothesis tests in linear regression analysis Conduct hypothesis tests for the overall statistical significance of the estimated sample regression function Conduct hypothesis tests for the statistical significance of the estimated slope coefficient, Understand how to read regression output in Excel Construct confidence intervals around the predicted value of y.

Understand the Goals of Hypothesis Testing
Goal: To gain insight into specific parameters that exist for a population Problem: Because the population is unobserved the specific population parameters are unobserved. Approach: Draw a specific random sample from the population and observe sample statistics for that sample. Use those observed sample statistics as the best guess of the unobserved population parameters. Ask whether it is likely to observe the sample statistics actually observed if the unobserved population parameter equals a hypothesized value.

Construct Sampling Distributions
A sampling distribution is the distribution of a sample statistic based on random sampling. If we want to construct a sampling distribution for the estimated slope coefficient we would draw many, many samples from the population and estimate the slope each time and then construct a relative frequency histogram

Relative Frequency Histogram of 100 Different Slope Coefficients

Understand Desirable Properties of Simple Linear Regression Estimators
Two desirable properties of estimators are: Unbiased – when the average value of all possible estimators equals the true population value or Efficient – if two estimators are unbiased then one estimator is more efficient than the other estimator if it has a lower variance.

Visual Depiction of the Sampling Distribution of an Unbiased Estimator
is the average value of all possible values of from a sample of size n from the population.

What does the Term Unbiased Mean?
An estimator is unbiased if it equals the true population parameter on average. If an estimator is unbiased it does not mean that for one particular sample the estimated value equals the true population parameter. If an estimator is not unbiased then it is biased

Visual Depiction of a Biased Estimator

Visual Depiction of Efficiency

What does the Term Efficient Mean?
If both estimators are unbiased then an estimator is more efficient than another estimator if it has a lower variance. A more efficient estimator is preferred to a less efficient estimator because does a better job estimating the true population parameter (extreme values are less likely).

Understand the Simple Linear Regression Assumptions Required for OLS to be the Best Linear Unbiased Estimator Assumptions Required for OLS to be Unbiased   Assumption S1: The model is linear in the parameters Assumption S2: The data are collected through independent, random sampling Assumption S3: There must be sample variation in the independent variable Assumption S4: The error term has zero mean Assumption S5: The error term is uncorrelated with the independent variable and all functions of the independent variable. Additional Assumption Required for OLS to be BLUE   Assumption S6: The error term has constant variance.  Note that these assumptions are theoretical and typically can’t be proven or disproven.

Assumption S1: Linear in the Parameters
This assumption states that for OLS to be unbiased, the population model must be correctly specified as linear in the parameters.

When is Assumption S1 Violated?
If the population regression model is non-linear in the parameters, i.e. If the true population model is not specified correctly, i.e. if the true model is but the model on the previous slide is the one that is estimated.

Assumption S2: The Data are Collected through Simple Random Sampling
This assumption states that for OLS to be unbiased, the data must be obtained through simple random sampling. This assumption ensures that the observations are statistically independent of each other across the units of observations.

When is Assumption S2 Violated?
If the data are time series data such as GDP and interest rates for the US collected over time. In this circumstance observations from this time period are likely related to observations in previous time periods. If there is some type of selection bias in the sampling. For example, if individuals opt to be in a job training program, go to college, or the response rate for a survey is low.

Assumption S3: There Must be Sample Variation in the Independent Variables
This assumption states that for OLS to be unbiased, the independent variable cannot be all the same value or This assumption ensures that slope estimator is defined. If there is no sample variation in the independent variable then will be undefined. This assumption is almost never violated.

A Visual Depiction of S3 being Violated

Assumption S4: The Error Term has Zero Mean
This assumption states that for OLS to be unbiased, the average value of the population error term is zero or This assumption will hold as long as an intercept is included in the model. This is because if the average value of the error term equals a value other than zero then the intercept will change accordingly.

Assumption S5: The Error Term is Not Correlated with the Independent Variable or Any Function of the Independent Variable This assumption states that for OLS to be unbiased, the error term is uncorrelated with the independent variable and all functions of the independent variable This is read as the expected value of ε given x is equal to 0.

How to Determine if Assumption S5 Violated?
Think of all the factors that affect the dependent variable that are not specified in the model. For the salary vs. education example variables that are in the error term include experience, ability, job type, gender, and many other factors. If any of these factors, say ability, are related to the independent variable, education, then violation S5 is violated. Note that the error term is never observed so determining whether S5 is violated is only a thought experiment.

The Importance of S1 through S5
If assumptions S1 through S5 hold, then the OLS estimates are unbiased or in equation form and Note that in simple linear regression analysis for non-experimental data (i.e. the type of data economists use) that these assumptions almost always fail and therefore the OLS estimates are typically biased.

Assumption S6: The Error Term has Constant Variance
This assumption states that the error term is has a constant variance or in equation form This is called homoskedasticity. If this assumption fails then the error term is heteroskedastic or the error term has a non-constant variance.

How to Determine if Assumption S6 Violated?
Create a scatter plot of y against x and decide whether the points are scattered in a constant manner around the line. Heteroskedasiticy does not have to look like the graph on the right on the next slide, there just has to be a non-constant distribution of the data points along the line. Chapter 9 gives a more in depth coverage of this topic.

Visual Depiction of Homoskedasticity versus Heteroskedasticity

The Importance of S1 through S6
If assumptions S1 through S6 hold, then the OLS estimates are BLUE or the Best Linear Unbiased Estimators. In this instance Best means minimum variance. This means that among all linear unbiased estimators of the population slope and population intercept, the OLS estimates have the lowest variance. As before, in simple linear regression analysis in economics these assumption rarely hold.

Understand How to Conduct Hypothesis Tests in Linear Regression Analysis
There are three different methods used to test hypotheses in this chapter Confidence Interval Critical Value P-value All three of these methods yield the same conclusion.

Conduct Hypothesis Tests for the Overall Significance of the Regression Function: the F-test
A hypothesis of the overall significance of the regression model tests whether any of the explanatory variables have a statistically significant effect on the dependent variable In simple linear regression there is only one explanatory variable so the hypotheses are x does not affect y x affects y

The F-test Continued The test statistic for this hypothesis is where remember k = # of explanatory variables, in this case k = 1.

The F-test Continued The p-value for this hypothesis is computed from the F distribution and is found in the ANOVA table under the column titled Significance F. Rejection rule using the critical value: Reject H0 if the F-statistic > Fα, k, n-k-1 Rejection rule using the p-value: Reject H0 if the p-value < α

F-test: Excel Output Critical Value:
Reject H0 if F-statistic is > 2.31 Because > 2.31 we reject H0 and conclude that education affects salary at the 5% level. p-value: Reject H0 if p-value < .05 Because < .05 we reject H0 and conclude that education affects salary at the 5% level. MSExplained MSUnexplained F-Statistic p-value

Conduct Hypothesis Tests for the Individual Significance of the Slope Coefficient: the t-test
A hypothesis of the individual significance of the regression model tests whether one explanatory variable has a statistically significant effect on the dependent variable In simple linear regression there is only one explanatory variable so the hypotheses are x does not affect y x affects y

Standard Error of the Slope Coefficient
To test individual significance of the slope coefficient we need to know the standard error of the slope or The standard error of the slope decreases if The SSUnexplained goes down The number of observations goes up The variance of x increases (the square root of the numerator of the variance formula is in the denominator

Fitting a Better Line When There is More Variation in the Independent Variable, x

Where to Find the Standard Error of the Slope in Regression Output
Standard Error of the Slope Coefficient

Using a Confidence Interval to Test for the Individual Significance of the Slope Coefficient
The Hypothesis is The formula for the confidence interval is The rejection rule is: Reject H0 if the 0 is not within the confidence interval

Confidence Interval for Individual Significance of the Slope Coefficient
(\$4,320.86, \$18,194.3) Reject H0 because 0 is not within this interval and conclude education affects salary

Using the p-value to Test for the Individual Significance of the Slope Coefficient
The Hypothesis is The rejection rule is: Reject H0 if the p-value < α

Using the p-value to Test for the Individual Significance of the Slope Coefficient
The Hypothesis is The rejection rule is: Reject H0 if the p-value < 0.05 Decision: Because < 0.05 we reject H0 and conclude that education affects salary

Using Excel to test for the Individual Significance of the Slope Coefficient
p-value = 95% Confidence interval (\$4,332.85, \$18,182.30) t-statistic = 3.75

Construct Confidence Intervals around the Predicted Value of y
The formula for the confidence interval is where is the predicted value, is the critical value from the t-table, and is the standard error of the prediction. The only component that we don’t know how to obtain is the standard error of the prediction

The Standard Error for a prediction about the mean value of y
Standard error for the mean of y given a particular xp

Example: Salary vs. Education
Estimated Regression Equation: Education in years (x) Salary in \$ (y) 12 33,000 22,000 14 38,000 16 45,000 40,000 50,000 17 60,000 18 43,000 19 135,000 20 122,000 Predict the mean salary for people with 17 years of education.

Example of Applying the Confidence Interval Formula for a Mean
The predicted value when xp is 17 is This says that a person with 17 years of education is predicted to earn \$70,057.58 The critical value is

Example of Applying the Confidence Interval Formula for a Mean

Example of Applying the Confidence Interval Formula for the Mean
The 95% confidence interval when xp is 17 is This is a 95% confidence interval around the predicted salary for the mean.

Confidence Interval for an Individual y, Given x
Confidence interval estimate for an Individual value of y given a particular xp This extra term adds to the interval width to reflect the added uncertainty for an individual case

Example of Applying the Confidence Interval Formula for an Individual
The 95% confidence interval when xp is 17 is This is a 95% confidence interval around the predicted salary for an individual. This interval is much wider that the interval for the mean value

Interval Estimates for Different Values of x
Prediction Interval for an individual y, given xp y Confidence Interval for the mean of y, given xp y = β0 + β1x x xp x