Presentation on theme: "Inference for Regression Today we will talk about the conditions necessary to make valid inference with regression We will also discuss the various types."— Presentation transcript:
Inference for Regression Today we will talk about the conditions necessary to make valid inference with regression We will also discuss the various types of inference that can be made with regression All this will be discussed in the context of simple linear regression
Conditions for Valid Inference If these conditions hold Independence Normal errors Errors have mean zero Errors have equal SD Then we can Find CI’s for the slope and intercept Find a CI for the mean of y at a given x Find a prediction interval (PI) for an individual y at a given x
Checking the Conditions After performing regression, we can check the conditions for valid inference. Independence – good sampling and experimental techniques give independence. Mean zero errors – verify using a residual plot. Errors with constant standard deviation (homoscedasticity) – verify with residual plot. Normal errors – verify with a normal quantile plot of the residuals.
Residual Plot Residuals are sample estimates of the errors (in y) made by the line. Residual = observed y – predicted y Predicted y’s are determined by the equation of the least squares line. A residual plot is a scatter plot of the residuals vs. x.
Examining Residual Plots A residual plot indicates that the conditions of homoscedastic mean zero errors if the points appear randomly scattered with a constant spread in the y direction. There should not be an obvious pattern to the residuals, nor should they “fan out”. Fanning out indicates nonconstant variability, called heteroscedasticity.
Residual Normal Quantile Plot A normal quantile plot of the residuals should be examined to check for non- normal errors. Plots that indicate skewness or heavy tails show that regression inference using the normal distribution or the t-distribution will not be valid.
What about independence? Independence means that the (x,y) measurements do not depend on one another A simple random sample will achieve independence Common cases without independence: same individual measured twice or including related persons. You are responsible for insuring independence by planning your study well. If achieving independence is impossible, a professional statistician can help you perform valid inference using more advanced techniques.
Examples of Unmet Conditions Trying to fit a straight line to yield data - obvious mistake because it shows curvature Shows up on residual plot as curvature - NOT CENTERED AT ZERO
Examples of Unmet Conditions This data set has unequal spread - thus the errors have unequal standard deviation. Shows up on residual plot as a “horn” or “fanning out” pattern. If residuals are not centered at zero or exhibit curvature - STOP - normality is inconsequential, seek help of a professional statistician.
Back to Hand-Span Example We fit a line to the data Check conditions –Independence - used different people in simple random sample - OK –errors have mean zero - checked residual plot - OK –errors have equal SD - checked residual plot - OK –errors have light tails -OK Since the conditions hold we can: –Form CI’s for population slope and intercept –Form CI’s for mean span of all individuals of a given height –Form prediction interval of hand-span for an individual of a given height
Complete Regression Output Source | SS df MS Number of obs = F( 1, 10) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = Spancm | Coef. Std. Err. t P>|t| [95% Conf. Interval] Heightin | _cons | Top right: Adjusted R 2 =.7685, RMSE = Bottom part - Heightin row corresponds to slope, _cons row to the intercept Coef column tells us Span (cm) = Height (in) t, P>|t| tests H 0 : 1 (slope) = 0 and H 0 : 0 (intercept) = 0, gives t & p-value 95% Conf Interval gives 95% confidence interval for slope & intercept 95% confident that population slope is in (.31,.71) 95% confident that population intercept is in (-27.63, -.38)
Inference on the Slope Why test H 0 : 1 (slope) = 0? If population slope is zero, then the line is useless - knowing x gives me no additional information about y Example: If Span (cm) = Height, then knowing height tells me nothing about span, all I know is the population mean of hand spans is approximately.51 cm
Inference on the Intercept Why test H 0 : 0 (intercept) = 0? This depends on the situation. Note that the estimate for y when x=0 is the intercept. Y = 0 + 1 (x=0) = 0 Example: Hand Span for person with height 0 is cm What? Regression is valid only within the range of observed data. We didn’t observe anyone with height 0. Line fits for observed region - may or may not for other regions.
Confidence Interval for the Mean At each value of x, we have a population of values for y. We would like to form a confidence interval for the mean of y for this population Example: Estimate the mean hand-span of persons of height 70.5 inches Center of interval will come from line Span = (70.5) = cm Endpoints of interval can be calculated, however, we will use a plot to estimate them
Prediction Interval for an Individual Suppose I have an individual of height 70.5 inches. Can I form an interval that I can be 95% confident that I will include the hand span of this individual? This is called a prediction interval By necessity, it is wider than a CI for the mean - predicting an individual is more challenging Center at line, endpoints estimated with plot
Confidence and Prediction Intervals CI for mean hand-span at height=70.5 looks to be approx (21,23) Widens as we approach edge of data - less certain at edges PI for hand-span of an individual of height 70.5 inches looks to be approx (19,25) PI wider than CI PI includes most or all data
Crab Meat Example Easy to get total weight - really want weight of meat Want to predict meat weight based on total weight Scatterplot (total, meat) shows linearity Fit a straight line, it looks like an appropriate thing to do
Crab Meat Conditions Residual plot looks evenly spread and centered at zero Residuals are normal (or light tailed) Independence - took a simple random sample of crabs in the tank farm - never measured the same crab twice Conditions for inference are met
Crab Meat Output Source | SS df MS Number of obs = F( 1, 10) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = Meatmg | Coef. Std. Err. t P>|t| [95% Conf. Interval] Totalwtg | _cons | Adjusted R2 =.7285 (total weight explained 72% of variability in meat weight) RMSE = (standard deviation of errors) Equation: Meat weight (mg) = Total Weight (g) Reject null of zero slope, 95% CI for slope (8.88, 20.88) Fail to reject null of zero intercept (good!), 95% CI (-68,94)
Crab Meat CI’s and PI’s Confidence intervals for mean of meat weight given total weight Prediction intervals for individual crab’s meat weight given total weight For Total weight = 10, mean is in (100,200) and individual in (0,300)
Review and Preview We’re trying to find mathematical relationships between variables Start with scatterplots - if linear, correlation gives a good measure of the nature and strength of relationship Simple linear regression finds the equation of the least squares line Under certain conditions, an array of inferences can be performed
Review and Preview Next time, we’ll discuss other types of regression. Conditions for inference and nature of inferences will be the same.
Using StataQuest: Handspan Example Getting a Scatterplot Open the data set heightspan.dta Go to Graphs: Scatterplots: Plot Y vs. X Y=Spancm, X=Heightin Click OK Finding the correlation Go to Statistics: Correlation: Pearson (regular) Choose Spancm and Heightin, click OK r=.8767, p-val=.0002 for H 0 : =0
StataQuest: Handspan Example Performing Regression Go to Statistics: Simple Regression Dependent=Spancm Independent=Heightin Click OK. Checking Conditions We use these plots: Plot fitted model Plot residual vs. an X, choose Heightcm Normal Quantile plot of residuals For PI’s and CI’s choose Plot fitted model