2Outline Outline Outline Simple Linear Regression Simple Linear RegressionMultiple Regression Multiple RegressionUnderstanding the Regression Output Understanding the Regression OutputCoefficient of Determination R Coefficient of Determination R2Validating the Regression Model
3Linear Regression: An Example First-YearAdvertisingExpenditures($ m illions)A pplegloFirst-YearSales($ millions )Reg ionMaineNew HampshireVermontMassachusettsConnecticutRhode IslandNew YorkNew JerseyPennsylvaniaDelawareMarylandWest VirginiaVirginiaOhioFirst Year Sales ($Millions)Advertising Expenditures ($Millions)Questions: a) How to relate advertising expenditure to sales?b) What is expected first-year sales if advertising expenditureis $2.2 million?c) How confident is your estimate? How good is the “fit?”
4The Basic Model: Simple Linear Regression Data: (x1, y1), (x2, y2), , (xn, yn)Model of the population : Yi = β0 + β1 xi + εiε1, ε2, , εn are are i.i.d. random variables, random variables, N(0, σ )This is the true relation between Y and x, but we do not know not know β0and β1 and have to estimate them based on the data.Comments:• E (Yi | xi ) = β0 + β1xi• SD(Yi | xi) = σ• Relationship is linear Relationship is linear – described by a “line”• β0 = “baseline” value of Y (i.e., value of Y if x is 0)• β1 = “slope” of line (average change in Y per unit change in x)
5How do we choose the line that “best” fits the data? Best choices:bo = 13.82b1 = 48.60First Year Sales ($M)Advertising Expenditures ($M)Regression coefficients : b0 and b1 are are estimates of β0 and β1Regression estimate for Y at xi : (prediction)Residual (error):The “best” regression line is the one that chooses b0 and b1 to minimize the total errors (residual sum of squares):
6Example: Sales of Nature-Bar ($ million) region sales advertising promotions competitor’ssalesSelkirkSusquehannaKitteryActonFinger LakesBerkshireCentralProvidenceNashuaDunsterEndicottFive-TownsWaldeboroJacksonStowe
7Multiple Regression• In general, there are many factors in addition to advertisingexpenditures that affect sales• Multiple regression allows more than one x variables.Independent variables: x1, x2, , xk (k of them)Data: (y1, x11, x21, , xk1), , (yn, x1n, x2n, , xkn),Population Model: Yi = β0 + β1x1i βkxki + εiε1, ε2, , εn are are iid random variables, ~ N(0, σ)Regression coefficients : b0, b1,…, bk are estimates of β0, β1,…,βk .Regression Estimate of yi :Goal: Choose b0, b , b1, ... , , ... , bk to minimize the residual sum ofsquares. I.e., minimize:
9Understanding Regression Output 1) Regression coefficients : b0, b1, , bk are estimates are estimatesof β0, β1, , βk based on sample data. Fact: E[ Fact: E[bj ] = βj .Example:b0 = (its interpretation is context dependent .b1 = (an additional $1 million in advertising isexpected to result in an additional $49 million in sales)b2 = (an additional $1 million in promotions is expected toresult in an additional $60 million in sales)b3 = (an increase of $1 million in competitor salesis expected to decrease sales by $1.8 million)
10Understanding Regression Output, Continued 2) Standard errors : an estimate of σ, the SD of each the SD of each εi.It is a measure of the amount of “noise” in the model.Example: s = 17.603) Degrees of freedom : #cases - #parameters,relates to over relates to over-fitting phenomenon4) Standard errors of the coefficients: sb0 , sb1, , sbkThey are just the standard deviations of the estimatesb0, b1, , bk.They are useful in assessing the quality of the coefficientestimates and validating the model.
11R2 takes values between 0 and 1 (it is a percentage). R2 = in ourAppleglo ExampleR2 = 0; x values account for none variation in the Y valuesR2 = 1; x values account for all variation in the Y values
12Understanding Regression Output, Continued 5) Coefficient of determination: R2• It is a measure of the overall quality of the regression.• Specifically, it is the percentage of total variation exhibited in the yidata that is accounted for by the sample regression line.The sample mean of Y:Total variation in Y Total variation in Y =Residual (unaccounted) variation in Yvariation accounted for by x variablestotal variationvariation not accounted for by x variablestotal variation
13Coefficient of Determination: R2 • A high R2 means that most of the variation we observe inthe yi data can be attributed to their corresponding x values−− a desired property.• In simple regression, the R2 is higher if the data points are isbetter aligned along a line. But outliers better aligned– Anscombeexample.• How high a R2 is “good” enough depends on the situation (forexample, the intended use of the regression, and complexity of theproblem).• Users of regression tend to be fixated on R2, but it’s notthe , whole story. It is important that the regression model is “valid.”
14Coefficient of Determination: R2 • One should not include x variables unrelated to Y in the model,just to make the R2 fictitiously high. (With more x variables fictitiously high.there will be more freedom in choosing the bi’s ’s to make theresidual variation closer to 0).• Multiple R is just the square root of R Multiple R is just the square root
15Validating the Regression Model Assumptions about the population: about the population:Yi = β0 + β1x1i βkxki + εi (i = 1, , n)ε1, ε2, , εn are are iid random variables, ~ N(0, σ)1) Linearity• If k = 1 (simple regression), one can check visually from scatter plot.• “Sanity check:” the sign of the coefficients, reason for non-linearity?2) Normality of Normality of εi• Plot a histogram of the residuals• Usually, results are fairly robust with respect to this assumption.
16• Check scatter plot of residuals vs. Y and x variables. 3) Heteroscedasticity• Do error terms have constant Std. Dev.? (i.e., SD(εi) = σ for all i?)• Check scatter plot of residuals vs. Y and x variables.ResidualsResidualsResiduAdvertisingAdvertising ExpendituresNo evidence of heteroscedasticityEvidence of heteroscedasticity• May be fixed by introducing a transformation• May be fixed by introducing or eliminating some independent variables
174) Autocorrelation : Are error terms independent? Plot residuals in order and check for patternsTime PlotTime PlotResidualResidualNo evidence of autocorrelationEvidence of autocorrelation• Autocorrelation may be present if observations have a naturalsequential order (for example, time).• May be fixed by introducing a variable or transforming a variable.
18Pitfalls and Issues 1) Overspecification • Including too many x variables to make R2 fictitiously high.• Rule of thumb: we should maintain that Rule of thumb: n >= 5(k+2).2) Extrapolating beyond the range of dataAdvertising
19Validating the Regression Model 3) Multicollinearity• Occurs when two of the x variable are strongly correlated.• Can give very wrong estimates for βi’s.• Tell-tale signs:- Regression coefficients (bi’s) have the “wrong” sign.- Addition/deletion of an independent variable results inlarge changes of regression coefficients- Regression coefficients (bi’s ) not significantly different from 0• May be fixed by deleting one or more independent variables
21Regression Output What happened? Coefficients Standard Error R SquareStandard ErrorObservationsWhat happened?Coefficients Standard ErrorInterceptCollege GPAGMATCollege GPA and GMATare highly correlated!Graduate College GMATR SquareStandard ErrorObservationsGraduateCollegeGMATCoefficients Standard ErrorEliminate GMATInterceptCollege GPA
22Regression Models• In linear regression, we choose the “best” coefficients b0, b1, ... , bk as the estimates for β0, β1,…, βk .• We know on average each bj hits the right target βj .• However, we also want to know how confident we are aboutour estimates
23Back to Regression Output Regression StatisticsMultiple RR SquareAdjusted R SquareStandard ErrorObservationsAnalysis of VarianceSum ofSquaresMeanSquaredfRegressionResidualTotalCoefficientstStatisticP-valueLower95%Upper95%StandardErrorInterceptAdvertisingPromotionsCompet.Sales
24Regression Output Analysis 1) Degrees of freedom (dof)• Residual dof = n = n - (k+1) (We used up (k + 1) degrees offreedom in forming (k+1) sample estimates b0, b1, , bk .)2) Standard errors of the coefficients : sb0 , sb1, , sbk• They are just the SDs of estimates of b0, b1, , bk .• Fact: Before we observe bj and sbj, obeys at-distribution with dof = (n - k - 1), the same dof as the residual.• We will use this fact to assess the quality of our estimates bj .• What is a 95% confidence interval for βj?• Does the interval contain 0? Why do we care about this?
253) t-Statistic:• A measure of the statistical significance of each individual xjin accounting for the variability in Y.• Let c be that number for whichwhere T obeys a t-distribution distribution with dof = (n - k - 1).• If > c, then the c, then the α% C.I. for βj does not contain zero• In this case, we are α% confident that confident that βj different from zero.
26Example: Executive Compensation Pay Years in Change in Change inNumber ($1,000) position Stock Price (%) Sales (%) MBA?
27variables: Dummy• Often, some of the explanatory variables in a regression arecategorical rather than numeric.• If we think whether an executive has an MBA or not affects his/herpay, We create a dummy variable and let it be 1 if the executivehas an MBA and 0 otherwise.• If we think season of the year is an important factor to determinesales, how do we create dummy variables? How many?• What is the problem with creating 4 dummy variables?• In general, if there are m categories an x variable can belong to,then we need to create m-1 dummy variables for it.
29Heating Oil Consumption (1,000 gallons)Average Temperature (degrees Fahrenheit)
30oil consumption inverse temperature heating oil temperature inverse temperature
31The Practice of Regression • Choose which independent variables to include in the model,based on common sense and context specific knowledge.• Collect data (create dummy variables in necessary).• Run regression Run regression −− −− the easy part.• Analyze the output and make changes in the model −− thisis where the action is.• Test the regression result on “out-of of-sample” data
32The Post-Regression Checklist 1) Statistics checklist:Calculate the correlation between pairs of x variables−− watch for evidence of multicollinearityCheck signs of coefficients – do they make sense?Check 95% C.I. (use t-statistics as quick scan) – are coefficientssignificantly different from zero?R2 :overall quality of the regression, but not the only measure2) Residual checklist:Normality – look at histogram of residualsHeteroscedasticity – plot residuals with each x variableAutocorrelation – if data has a natural order, plot residuals inorder and check for a pattern
33The Grand Checklist• Linearity: scatter plot, common sense, and knowing your problem,transform including interactions if useful• t-statistics: are the coefficients significantly different from zero?Look at width of confidence intervals• F-tests for subsets, equality of coefficients tests for subsets,• R2: is it reasonably high in the context?• Influential observations, outliers in predictor space, dependent variable space• Normality: plot histogram of the residuals•Studentized residuals• Heteroscedasticity : plot residuals with each x variable, transform if necessary, Box-Cox transformations• Autocorrelation:”time series plot”• Multicollinearity : compute correlations of the x variables, do signs of coefficients agree with intuition?• Principal Components • Missing Values