Presentation on theme: "Lecture 17: Tues., March 16 Inference for simple linear regression (Ch. 7.3-7.4) R2 statistic (Ch. 8.6.2) Association is not causation (Ch. 7.5.3) Next."— Presentation transcript:
1Lecture 17: Tues., March 16Inference for simple linear regression (Ch )R2 statistic (Ch )Association is not causation (Ch )Next class: Diagnostics for asssumptions of simple linear regression model (Ch )
2RegressionGoal of regression: Estimate the mean response Y for subpopulations X=x,Example: Y= catheter length required, X=heightSimple linear regression model:Estimate and by least squares – choose to minimize the sum of squared residuals (prediction errors)
3Car Price ExampleA used-car dealer wants to understand how odometer reading affects the selling price of used cars.The dealer randomly selects 100 three-year old Ford Tauruses that were sold at auction during the past month. Each car was in top condition and equipped with automatic transmission, AM/FM cassette tape player and air conditioning.carprices.JMP contains the price and number of miles on the odometer of each car.
5Inference for Simple Linear Regression Inference based on the ideal simple linear regression model holding.Inference based on taking repeated random samples ( ) from the same subpopulations( ) as in the observed data.Types of inference:Hypothesis tests for intercept and slopeConfidence intervals for intercept and slopeConfidence interval for mean of Y at X=X0Prediction interval for future Y for which X=X0
6Ideal Simple Linear Regression Model Assumptions of ideal simple linear regression modelThere is a normally distributed subpopulation of responses for each value of the explanatory variableThe means of the subpopulations fall on a straight-line function of the explanatory variable.The subpopulation standard deviations are all equal (to)The selection of an observation from any of the subpopulations is independent of the selection of any other observation.
7Sampling Distributions of and See handout.See Display 7.7Standard deviation is smaller for (i) larger n, (ii) smaller , (iii) larger spread in x (higher )
8Hypothesis tests for and Hypothesis test of vs.Based on t-test statistic,p-value has usual interpretation, probability under the null hypothesis that |t| would be at least as large as its observed value, small p-value is evidence against null hypothesisInterpretation of null hypothesis: X is not a useful predictor of Y, mean of Y is not associated with X.Test for vs is based on an analogous test statistic.Test statistics and p-values can be found on JMP output under parameter estimates, obtained by using fit line after fit Y by X.For car price data, convincing evidence that both intercept and slope are not zero (p-value <.0001 for both).
9Confidence Intervals for and Confidence intervals provide a range of plausible values for and95% Confidence Intervals:Table A.2 lists It is approximately 2.Finding CIs in JMP: Go to parameter estimates, right click, click Columns and then click Lower 95% and Upper 95%.For car price data set, CIs:
10Two prediction problems The used-car dealer has an opportunity to bid on a lot of cars offered by a rental company. The rental company has 250 Ford Tauruses, all equipped with automatic transmission, air conditioning and AM/FM cassette tape players. All of the cars in this lot have about 40,000 miles on the odometer. The dealer would like an estimate of the average selling price of all cars in this lot (or, virtually equivalently, average selling price of population of Ford Tauruses with above equipment and 40,000 miles on the odometer).The used-car dealer is about to bid on a 3-year old Ford Taurus equipped with automatic transmission, air conditioner and AM/FM cassette tape player and with 40,000 miles on the odometer. The dealer would like to predict the selling price of this particular car.
11Prediction problem (a) Goal is to estimate the conditional mean of selling price given odometer reading=40,000,Point estimate isWhat is a range of plausible values for?
12Confidence Intervals for Mean of Y at X=X0 What is a plausible range of values for ?95% CI for :,Note about formulaPrecision in estimating is not constant for all values of X. Precision decreases as X0 gets farther away from sample average of X’sJMP implementation: Use Confid Curves fit command under red triangle next to Linear Fit after using Fit Y by X, fit line. Use the crosshair tool to find the exact values of the confidence interval endpoints for a given X0.
13Prediction Problem (b) Goal is to estimate the selling price of a given car with odometer reading=40,000.What are likely values for a future value Y0 at some specified value of X (=X0)?Best prediction is the estimated mean response for X=X0:A prediction interval is an interval of likely values along with a measure of the likelihood that interval will contain response.95% prediction interval for X0: If repeated samples are obtained from the subpopulations and a prediction interval is formed, the prediction interval will contain the value of Y0 for a future observation from the subpopulation X0 95% of the time.
14Prediction Intervals Cont. Prediction interval must account for two sources of uncertainty:Uncertainty about the location of the subpopulation meanUncertainty about where the future value will be in relation to its meanPrediction Error = Random Sampling Error + Estimation Error
15Prediction Interval Formula 95% prediction interval at X0Compare to 95% CI for mean at X0:Prediction interval is wider due to random sampling error in future responseAs sample size n becomes large, margin of error of CI for mean goes to zero but margin of error of PI doesn’t.JMP implementation: Use Confid Curves Indiv command under red triangle next to Linear Fit after using Fit Y by X, fit line. Use the crosshair tool to find the exact values of the confidence interval endpoints for a given X0.
17R-SquaredThe R-squared statistic, also called the coefficient of determination, is the percentage of response variation explained by the explanatory variable.Unitless measure of strength of relationship between x and yTotal sum of squares = Best sum of squared prediction error without using x.Residual sum of squares =
18R-Squared ExampleR2= Read as “65.01 percent of the variation in car prices was explained by the linear regression on odometer.”
19Interpreting R2R2 takes on values between 0 and 1, with higher R2 indicating a stronger linear association.If the residuals are all zero (a perfect fit), then R2 is 1. If the least squares line has slope 0, R2 will be 0.R2 is useful as a unitless summary of the strength of linear association.
20Caveats about R2R2 is not useful for assessing model adequacy, i.e., does simple linear regression model hold (use residual plots) or whether or not there is an association (use test ofvs )A good R2 depends on the context. In precise laboratory work, R2 values under 90% might be too low, but in social science contexts, when a single variable rarely explains great deal of variation in response, R2 values of 50% may be considered remarkably good.
21Association is not causation A high means that x has a strong linear relationship with y – there is a strong association between x and y. It does not imply that x causes y.Alternative explanations for high :Reverse is true. Y causes X.There may be a lurking (confounding) variable related to both x and y which is the common cause of x and yNo cause and effect relationship can be inferred unless X is randomly assigned to units in a random experiment.A researcher measures the number of television sets per person X and the average life expectancy Y for the world’s nations. The regression line has a positive slope – nations with many TV sets have higher life expectancies. Could we lengthen the lives of people in Rwanda by shipping them TV sets?
22ExampleA community in the Philadelphia area is interested in how crime rates affect property values. If low crime rates increase property values, the community may be able to cover the costs of increased police protection by gains in tax revenues from higher property values. Data on the average housing price and crime rate (per 1000 population) for communities in Pennsylvania near Philadelphia for 1996 are shown in housecrime.JMP.
24QuestionsCan you deduce a cause-and-effect relationship from these data? What are other explanations for the association between housing prices and crime rate other than that high crime rates cause low housing prices?Does the ideal simple linear regression model appear to hold?