I. Introduction: Simple Linear Regression.  As discussed last semester, what are the basic differences between correlation & regression?  What vulnerabilities.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
Inference for Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Simple Linear Regression and Correlation
Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.
Objectives (BPS chapter 24)
Sociology 601, Class17: October 27, 2009 Linear relationships. A & F, chapter 9.1 Least squares estimation. A & F 9.2 The linear regression model (9.3)
Sociology 601 Class 19: November 3, 2008 Review of correlation and standardized coefficients Statistical inference for the slope (9.5) Violations of Model.
Multiple regression analysis
Chapter 10 Simple Regression.
Chapter 12 Simple Regression
BA 555 Practical Business Analysis
Statistics for Business and Economics
The Simple Regression Model
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 13 Introduction to Linear Regression and Correlation Analysis.
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
Pengujian Parameter Koefisien Korelasi Pertemuan 04 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
Linear Regression and Correlation Analysis
REGRESSION AND CORRELATION
Multiple Linear Regression
Introduction to Probability and Statistics Linear Regression and Correlation.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Correlation and Regression Analysis
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Introduction to Regression Analysis, Chapter 13,
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Linear Regression/Correlation
EDUC 200C Section 4 – Review Melissa Kemmerle October 19, 2012.
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Regression and Correlation Methods Judy Zhong Ph.D.
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Chapter 13: Inference in Regression
Hypothesis Testing in Linear Regression Analysis
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Statistics for Business and Economics Chapter 10 Simple Linear Regression.
© 2003 Prentice-Hall, Inc.Chap 13-1 Basic Business Statistics (9 th Edition) Chapter 13 Simple Linear Regression.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
© 2001 Prentice-Hall, Inc. Statistics for Business and Economics Simple Linear Regression Chapter 10.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
Multiple regression - Inference for multiple regression - A case study IPS chapters 11.1 and 11.2 © 2006 W.H. Freeman and Company.
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Summarizing Bivariate Data
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
MBP1010H – Lecture 4: March 26, Multiple regression 2.Survival analysis Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
Biostat 200 Lecture Simple linear regression Population regression equationμ y|x = α +  x α and  are constants and are called the coefficients.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
VI. Regression Analysis A. Simple Linear Regression 1. Scatter Plots Regression analysis is best taught via an example. Pencil lead is a ceramic material.
Lecture 10: Correlation and Regression Model.
Agresti/Franklin Statistics, 1 of 88 Chapter 11 Analyzing Association Between Quantitative Variables: Regression Analysis Learn…. To use regression analysis.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Correlation & Regression Analysis
Chapter 8: Simple Linear Regression Yang Zhenlin.
Regression Analysis Presentation 13. Regression In Chapter 15, we looked at associations between two categorical variables. We will now focus on relationships.
BPS - 5th Ed. Chapter 231 Inference for Regression.
11-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
Chapter 13 Simple Linear Regression
Correlation and Regression
Simple Linear Regression
Basic Practice of Statistics - 3rd Edition Inference for Regression
Presentation transcript:

I. Introduction: Simple Linear Regression

 As discussed last semester, what are the basic differences between correlation & regression?  What vulnerabilities do correlation & regression share in common?  What are the conceptual challenges regarding causality?

 Linear regression is a statistical method for examining how an outcome variable y depends on one or more explanatory variables x.  E.g., what is the relationship of the per capita earnings of households to their numbers of members & their members’ ages, years of higher education, race-ethnicity, gender & employment statuses?

 What is the relationship of the fertility rates of countries to their levels of GDP per capita, urbanization, education, & so on?  Linear regression is used extensively in the social, policy, & other sciences.

Multiple regression—i.e. linear regression with more than one explanatory variable—makes it possible to:  Combine many explanatory variables for optimal understanding &/or prediction; &  Examine the unique contribution of each explanatory variable, holding the levels of the other variables constant.

 Hence, multiple regression enables us to perform, in a setting of observational research, a rough approximation to experimental analysis.  Why, though, is experimental control better than statistical control?  So, to some degree multiple regression enables us to isolate the independent relationships of particular explanatory variables with an outcome variable.

So, concerning the relationship of the per capita earnings of households to their numbers of members & their members’ ages, years of education, race-ethnicity, gender & employment statuses:  What is the independent effect of years of education on per capita household earnings, holding the other variables constant?

 Regression is linear because it’s based on a linear (i.e. straight line) equation.  E.g., for every one-year increase in a family member’s higher education (an explanatory variable), household per capita earnings increase by $3127 on average, holding the other variables fixed.

 But such a statistical finding raises questions: e.g., is a year of college equivalent to a year of graduate school with regard to household earnings?  We’ll see that multiple regression can accommodate nonlinear as well as linear y/x relationships.  And again, always question whether the relationship is causal.

Before proceeding, let’s do a brief review of basic statistics.  A variable is a feature that differs from one observation (i.e. individual or subject) to another.

 What are the basic kinds of variables?  How do we describe them in, first, univariate terms, & second, bivariate terms?  Why do we need to describe them both graphically & numerically?

 What’s the fundamental problem with the mean as a measure of central tendency & standard deviation as a measure of spread? When should we use them?  Despite their problems, why are the mean & standard deviation used so commonly?

 What’s a density curve? A normal distribution? What statistics describe a normal distribution? Why is it important?  What’s a standard normal distribution? What does it mean to standardize a variable, & how is it done?  Are all symmetric distributions normal?

 What’s a population? A sample? What’s a parameter? A statistic? What are the two basic probability problems of samples, & how most basically do we try to mitigate them?  Why is a sample mean typically used to estimate a parameter? What’s an expected value?

 What’s sampling variability? A sampling distribution? A population distribution?  What’s the sampling distribution of a sample mean? The law of large numbers? The central limit theorem?  Why’s the central limit theorem crucial to inferential statistics?

 What’s the difference between a standard deviation & a standard error? How do their formulas differ?  What’s the difference between the z- & t-distributions? Why do we typically use the latter?

 What’s a confidence interval? What’s its purpose? Its premises, formula, interpretation, & problems? How do we make it narrower?  What’s a hypothesis test? What’s its purpose? Its premise & general formula? How is it stated? What’s its interpretation?

 What are the typical standards for judging statistical significance? To what extent are they defensible or not?  What’s the difference between statistical & practical significance?

 What are Type I & Type II errors? What is the Bonferroni (or other such) adjustment?  What are the possible reasons for a finding of statistical insignificance?

True or false, & why:  Large samples are bad.  To obtain roughly equal variability, we must take a much bigger sample in a big city than in a small city.  You have data for an entire population. Next step: construct confidence intervals & conduct hypothesis tests for the variables. Source: Freedman et al., Statistics.

(true-false continued)  To fulfill the statistical assumptions of correlation or regression, what definitively matters for each variable is that its univariate distribution is linear & normal. __________________________

Define the following:  Association  Causation  Lurking variables  Simpson’s Paradox  Spurious non-association  Ecological correlation

 Restricted-range data  Non-sampling errors _________________________

Regarding variables, ask:  How they are defined & measured?  In what ways are their definition & measurement valid or not?  & what are the implications of the above for the social construction of reality?  See King et al., Designing Social Inquiry; & Ragin, Constructing Social Research.

Remember the following, overarching principles concerning statistics & social/policy research from last semester’s course: (1) Anecdotal versus systematic evidence (including the importance of theories in guiding research). (2) Social construction of reality.

(3) Experimental versus observational evidence. (4) Beware of lurking variables. (5) Variability is everywhere. (6) All conclusions are uncertain.

 Recall the relative strengths & weaknesses of large-n, multivariate quantitative research versus small- n, comparative research & case- study research.  “Not everything worthwhile can be measured, and not everything measured is worthwhile.” Albert Einstein

 And always question presumed notions of causality.

Finally, here are some more or less equivalent terms for variables:  e.g., dependent, outcome, response, criterion, left-hand side  e.g., independent, explanatory, predictor, regressor, control, right-hand side __________________________

Let’s return to the topic of linear regression.

 The dean of students wants to predict the grades of all students at the end of their freshman year. After taking a random sample, she could use the following equation:

 Since the dean doesn’t know the value of the random error for a particular student, this equation could be reduced to using the sample mean of freshman GPA to estimate a particular student’s GPA: That is, a student’s predicted y (i.e. yhat) is estimated as equal to the sample mean of y.

 But what does that mini-model overlook?  That a more accurate model—& thus more precise predictions—can be obtained by using explanatory variables (e.g., SAT score, major, hours of study, gender, social class, race-ethnicity) to estimate freshman GPA.

 Here we see a major advantage of regression versus correlation: regression permits y/x directionality* (including multiple explanatory variables).  In addition, regression coefficients are expressed in the units in which the variables are measured. * Recall from last semester: What are the ‘two regression lines’? What questions are raised about causality?

 We use a six-step procedure to create a regression model (as defined in a moment):  Hypothesize the form of the model for E(y).  Collect the sample data on outcome variable y & one more more explanatory variables x: random sample, data on all the regression variables are collected for the same subjects.  Use the sample data to estimate unknown parameters in the model.

(4) Specify the probability distribution of the random error term (i.e. the variability in the predicted values of outcome variable y), & estimate any unknown parameters of this distribution. (5) Statistically check the usefulness of the model. (6) When satisfied that the model is useful, use it for prediction, estimation, & so on.

 We’ll be following this six-step procedure for building regression models throughout the semester.  Our emphasis, then, will be on how to build useful models: i.e. useful sets of explanatory variables x’s and forms of their relationship to outcome variable y.

 “A model is a simplification of, and approximation to, some aspect of the world. Models are never literally ‘true’ or ‘false,’ although good models abstract only the ‘right’ features of the reality they represent” (King et al., Designing Social Inquiry, page 49).  Models both reflect & shape the social construction of reality.

 We’ll focus, then, on modeling: trying to describe how sets of explanatory variables x’s are related to outcome variable y.  Integral to this focus will be an emphasis on the interconnections of theory & empirical research (including questions of causality).

 We’ll be thinking about how theory informs empirical research, & vice versa.  See King et al., Designing Social Inquiry; Ragin, Constructing Social Research; McClendon, Multiple Regression and Causal Analysis; Berk, Regression: A Constructive Critique.

 “A social science theory is a reasoned and precise speculation about the answer to a research question, including a statement about why the proposed answer is correct.”  “Theories usually imply several or more specific descriptive or causal hypotheses” (King et al., page 19).  And to repeat: A model is “a simplification of, and approximation to, some aspect of reality” (King et al., page 49).

 One more item before we delve into regression analysis: Regarding graphic assessment of the variables, keep the following points in mind:  Use graphs to check distributions & outliers before describing or estimating variables & models; & after estimating models as well.

 The univariate distributions of the variables for regression analysis need not be normal!  But the usual caveats concerning extreme outliers must be heeded.  It’s not the univariate graphs but the y/x bivariate scatterplots that provide the key evidence on these concerns.

Even so, let’s anticipate a fundamental feature of multiple regression:  The characteristics of bivariate scatterplots & correlations do not necessarily predict whether explanatory variables will be significant or not in a multiple regression model.

 Moreover, bivariate relationships don’t necessarily indicate whether a Y/X relationship will be positive or negative within a multivariate framework.

 This is because multiple regression expresses the joint, linear effects of a set of explanatory variables on an outcome variable.  See Agresti/Finlay, chapter 10; and McClendon, chapter 1 (and other chapters).

 Let’s start our examination of regression analysis, however, with a simple (i.e. one explanatory variable) regression model:

. su science math. corr science math. scatter science math||qfit science math

. reg science math Source SS df MS Number of obs= 200 F( 1, 198)= Model Prob > F= Residual R-squared= Adj R-squared= Total Root MSE= science Coef.Std. Err. tP>t[95% Conf. Interval] math _cons  Interpretation?

 For every one-unit increase in x, y increases (or decreases) by … units, on average.  For every one-unit increase in math score, science score increases by 0.67, on average.  Questions of causal order?

 What’s the standard deviation interpretation, based on the formulation for b, the regression coefficient?. su science math. corr science math  Or easier:. listcoef, help

regress (N=200): Unstandardized and Standardized Estimates Observed SD: SD of Error: science b t P>t bStdX bStdYbStdXYSDofX math b = raw coefficient t = t-score for test of b=0 P>t = p-value for t-test bStdX = x-standardized coefficient bStdY = y-standardized coefficient bStdXY = fully standardized coefficient SDofX = standard deviation of X

 What would happen if we reversed the equation?

. reg science math Source SS df MS Number of obs= 200 F( 1, 198)= Model Prob > F= Residual R-squared= Adj R-squared= Total Root MSE= science Coef.Std. Err. tP>t[95% Conf. Interval] math _cons  With science as the outcome variable.

. reg math science Source | SS df MS Number of obs = F( 1, 198) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = math | Coef. Std. Err. t P>|t| [95% Conf. Interval] science | _cons |  With math as the outcome variable.

 What would be risky in saying that ‘every one-unit increase in math scores causes a 0.67 increase in predicted science score’?

 Because: (1) Computer software will accept variables in any order & churn out regression y/x results—even if the order makes no sense. (2) Association does not necessarily signify causation. (3) Beware of lurking variables. (4) There’s always the likelihood of non-sampling error. (5) It’s much easier to disprove than prove causation.  So be cautious!

 See McClendon (pp. 4-7) on issues of causal inference.  How do we establish causality?  Can regression analysis be worthwhile even if causality is ambiguous?  See also Berk, Regression Analysis: A Constructive Critique.

 Why is a regression model probabilistic rather than deterministic?

 Because the model is estimated from sample data & thus will include some variation due to random phenomena than can’t be modeled or explained.  That is, the random error component represents all unexplained variation in outcome variable y caused by important but omitted variables or by unexplainable random phenomena.  Examples of a random error component for this model (i.e. using science scores to predict math scores)?

 There are three basic sources of error in regression analysis: (1) Sampling error (2) Measurement error (including non-sampling error) (3) Omitted variables See Allison, Multiple Regression: A Primer.

 Examine the type & quality of the sample.  Based on your knowledge of the topic: What variables are relevant? How should they be defined & measured? How actually are they defined & measured?  Examine the diagnostics for the model’s residuals (i.e. probabilistic, or ‘error’, component).

 After estimating a regression equation, we estimate the value of e associated with each y value using the corresponding residual, i.e. the deviation between the observed & predicted value of y.

 The model’s random error component consists of deviations between the observed & predicted values of y. These are the residuals (which, to repeat, are estimates of the model’s error component for each value of y).  Each observed science score minus each predicted science score.

. reg science math Source SS df MS Number of obs= 200 F( 1, 198)= Model Prob > F= Residual R-squared = Total Adj R-squared= Root MSE= science Coef.Std. Err. tP>t[95% Conf. Interval] math _cons

. predict yhat [predicted values of y] (option xb assumed; fitted values). predict e, resid [residuals]. sort science [to order its values from lowest to highest]. su science yhat e. list science yhat e in 1/10. list science yhat e in 100/110. list science yhat e in -10/l (‘l’ indicates ‘last’)

. reg science math Source SS df MS Number of obs= 200 F( 1, 198)= Model Prob > F= Residual R-squared = Total Adj R-squared= Root MSE= science Coef.Std. Err. tP>t[95% Conf. Interval] math _cons  SSResiduals (i.e. SSE)=

 The least squares line, or regression line, follows two properties: (1) The expected value of the errors (i.e. deviations or residuals) SE=0. (2) The sum of the squared errors, SSE, is smaller than for any other straight line model with SE=0.

 The regression line is called the least squares line because it minimizes the distance between the equation’s y-predictions & the data’s y-observations (i.e. it minimizes the sum of squared errors, SSE).  The better the model fits the data, the smaller the distance between the y-predictions & the y-observations.

 Here are the values of the regression model’s estimated beta (i.e. slope or regression) coefficient & y-intercept (i.e. constant) that minimize SSE:

where:

 Compute y-intercept:. su science math Variable | Obs Mean Std. Dev. Min Max science | math | display (.66658*52.645) Note: math slope coefficient= (see regression output)

. reg science math Source SS df MS Number of obs= 200 F( 1, 198)= Model Prob > F= Residual R-squared= Adj R-squared= Total Root MSE= science Coef.Std. Err. tP>t[95% Conf. Interval] math _cons  The y-intercept (i.e. the constant) matches our calculation:

 Compute math’s slope coefficient: Summation of each math-value times each science value, divided by summation of each math value squared.

. reg science math Source SS df MS Number of obs= 200 F( 1, 198)= Model Prob > F= Residual R-squared= Adj R-squared= Total Root MSE= science Coef.Std. Err. tP>t[95% Conf. Interval] math _cons

 We’ll eventually see that the probability distribution of e determines how well the model describes the population relationship between outcome variable y & explanatory variable x.  In this context, there are four basic assumptions about the probability distribution of e.  These are important to (1) minimize bias & (2) to make confidence intervals & hypothesis tests valid.

The Four Assumptions (1) The expected value of e over all possible samples is 0. That is, the mean of e does not vary with the levels of x. (2) The variance of the probability distribution of e is constant for all levels of x. That is, the variance of e does not vary with the levels of x.

(3) The errors associated with any two different y observations are 0. That is, the errors are uncorrelated: the errors associated with one value of y have no effect on the errors associated with other y values. (4) The probability distribution of e is normal.

 These assumptions of the regression model are commonly summarized as: I.I.D.  Independently & identically distributed errors.

 As we’ll come to understand, the assumptions make the estimated least squares line an unbiased estimator of the population value of the y-intercept & the slope coefficient—i.e. of the population value of y.

 Plus they make the standard errors of the estimated least squares line as small as possible & unbiased, so that confidence intervals & hypothesis tests are valid.

 Checking these vital assumptions— which need not hold exactly—is a basic part of post-estimation diagnostics.

 How do we estimate the variability of the random error e (which means variability in the predicted values of outcome variable y)?  We do so by estimating the variance of e (i.e. the variance of the predicted values of outcome variable y).

 Why must we be concerned with the variance of e?  Because the greater the variance of e, the greater will be the errors in the estimates of the y-intercept & slope coefficient.

 Thus the greater the variance of e, the more inaccurate will be the predicted value of y for any given value of x.

 Since we don’t know the population error,, we estimate it with sample data as follows: where s 2 =sum(each observed science score minus each predicted science score) 2 /df for error

 Standard error of e:

 Interpretation of s (yhat’s standard error): we are 95% certain that yhat’s values fall within an interval of roughly +/- 2*7.70 (i.e. +/- two standard deviations).  To display other confidence levels for this & the other regression output in STATA: reg y x1 x2, level(90)

 Assessing the usefulness of the regression model: making inferences about slope Ho: = 0. Ha: 0. (or one-tailed Ha in either direction)

. reg science math Source SS df MS Number of obs= 200 F( 1, 198)= Model Prob > F= Residual R-squared= Adj R-squared= Total Root MSE= science Coef.Std. Err. tP>t[95% Conf. Interval] math _cons .66658/ =11.44, p-value=  Hypothesis test & conclusion?

 Depending on the selected alpha (i.e. test criterion) & on the test’s p-value, either reject or fail to reject Ho.  The hypothesis test’s assumptions: probability sample; & the previously discussed four assumptions about e.

. reg science math Source SS df MS Number of obs= 200 F( 1, 198)= Model Prob > F= Residual R-squared= Adj R-squared= Total Root MSE= science Coef.Std. Err. tP>t[95% Conf. Interval] math _cons  How to compute a slope coefficient’s confidence interval?

 Compute math’s slope-coefficient confidence interval (.95):. di invttail(199,.05/2) t.95, df di (1.972* ) = low side of CI. di (1.972* ) = high side of CI Note: math slope coefficient=.66658; math slope coefficient standard error=

Conclusion for confidence interval:  We can say with 90% or 95% or 99% confidence that for every one- unit increase/decrease in x, y changes by +/- …… units, on average.  But remember: there are non- sampling sources of error, too.

 Let’s next discuss correlation.

 Correlation: a linear relationship between two quantitative variables (though recall from last semester that ‘spearman’ & other such procedures compute correlation involving categorical variables, or when assumptions for correlation between two quantitative variables are violated).  Beware of outliers & non-linearity: graph a bivariate scatterplot in order to conclude whether conducting a correlation test makes sense or not (& thus whether an alternative measure should be used).

 Correlation assesses the degree of bivariate cluster along a straight line: the strength of a linear relationship.  Regression examines the degree of y/x slope of a straight line: the extent to which y varies in response to changes in x.

 Regarding correlation, remember that association does not necessarily imply causation.  And beware of lurking variables.  Other limitations of correlation analysis?

Formula for correlation coefficient:  Standardize each x observation & each y observation.  Cross-multiply each pair of x & y observations.  Divide the sum of the cross-products by n – 1.

 In short, the correlation coefficient is the average of the cross-products of the standardized x & y values.  Here’s the equivalent, sum-of- squares formula:

 Hypothesis test for correlation: (or one-sided Ha in either direction)  Depending on selected alpha & on test p- value, either reject or fail to reject Ho.  The hypothesis test’s assumptions?

 Before estimating a correlation, of course, first graph the univariate & bivariate distributions.  Look for overall patterns & striking deviations, especially outliers.  Is the bivariate scatterplot approximately linear? Are there extreme outliers?

. hist science, norm

. hist math, norm

. scatter science math  Approximately linear, no extreme outliers.

. scatter science math || lfit science math

. scatter science math || qfit science math

 Hypothesis test:

. pwcorr science math, sig star(.05) | science math science | | math | * |  Hypothesis test conclusion?

Coefficient of determination, r 2 :  r 2 (in simple but not multiple regression, just square the correlation coefficient) represents the proportion of the sum of squares of deviations of the y values about their mean that can be attributed to a linear relationship between y & x.  Interpretation: about 100(r 2 )% of the sample variation in y can be attributed to the use of x to predict y in the straight-line model.  Higher r 2 signifies better fit: greater cluster along the y/x straight line.

 Formula for r 2 in simple & multiple regression:  How would this be computed for the regression of science on math?

. reg science math Source | SS df MS Number of obs = F( 1, 198) = Model | Prob > F = Residual R-squared = Adj R-squared = Total | Root MSE = science | Coef. Std. Err. t P>|t| [95% Conf. Interval] math | _cons |  r 2 =Model SS/Total SS= /

 Let’s step back for a moment & review the matter of explained versus unexplained variation in an estimated regression model. DATA = FIT + RESIDUAL  What does this mean? Why does it matter?

 DATA: total variation in outcome variable y; measured by the total sum of squares.  FIT: variation in outcome variable y attributed to the explanatory variable x (i.e. by the model); measured by the model sum of squares.  RESIDUAL: variation in outcome variable y attributed to the estimated errors; measured by the residual (or error) sum of squares.

DATA = FIT + RESIDUAL SST = SSM + SSE

 Sum of Squares Total (SST): each observed y minus the mean of y ; sum the values; square the summed values.  Sum of Squares for Model (SSM): each predicted y minus the mean of y ; sum the values; square the summed values.  Sum of Squares for Errors (SSE): each observed y minus the mean of predicted y ; sum the values; square the values.

. reg science math Source SS df MS Number of obs= 200 F( 1, 198)= Model Prob > F= Residual R-squared = Total Adj R-squared= Root MSE= science Coef.Std. Err. tP>t[95% Conf. Interval] math _cons

 Next step: compute the variance for each component by dividing its sum of squares by its degrees of freedom—its Mean Square: Mean Square for Total = Mean Square for Model + Mean Square for Errors (Residuals)  s 2 : Mean Square for Errors (Residuals)  s: Root Mean Square (se of yhat)

. reg science math Source | SS df MS Number of obs = F( 1, 198) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = science | Coef. Std. Err. t P>|t| [95% Conf. Interval] math | _cons |  Root MSE = square root of

 Analysis of Variance (ANOVA) Table: the regression output displaying the sums of squares & mean square for model, residual (error) & total.  How do we compute F & r 2 from the ANOVA table? F=Mean Square Model/Mean Square Residual r 2 =Sum of Squares Model/Sum of Squares Total

. reg science math Source | SS df MS Number of obs = F( 1, 198) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = science | Coef. Std. Err. t P>|t| [95% Conf. Interval] math | _cons |  F=MSM/MSR= / =  r 2 =MSS/TSS= / =0.3978

DATA = FIT + RESIDUAL SST = SSM + SSE

 Sum of Squares Total (SST): each observed y minus the mean of y ; sum the values; square the summed values.  Sum of Squares for Model (SSM): each predicted y minus the mean of y ; sum the values; square the summed values.  Sum of Squares for Errors (SSE): each observed y minus the mean of predicted y ; sum the values; square the values.

Using the regression model for estimation & prediction:  Fundamental point: never make predictions beyond the range of the sampled (i.e. observed) x values.  That is, while the model may provide a good fit for the sampled range of values, it could give a poor fit outside the sampled x-value range.

 Another point in making predictions: the standard error for the estimated mean of y will be less than that for an estimated individual y observation.  That is, there’s more uncertainty in predicting individual y values than mean y values.  Why is this so?

 Let’s review how STATA reports the indicators of how a regression model fits the sampled data.

. reg science math Source SS df MS Number of obs= 200 F( 1, 198)= Model Prob > F= Residual R-squared= Adj R-squared= Total Root MSE= science Coef.Std. Err. tP>t[95% Conf. Interval] math _cons

Software regression output typically refers to the residual- terms more or less as follows:  s 2 = Mean Square Error (MSE: variance of predicted y/d.f.)  s = Root Mean Square Error (Root MSE: standard error of predicted y [which equals the square root of MSE])

Stata labels the Residuals, Mean Square Error & Root Mean Square Error as follows: Top-left table  SS for Residual: sum of squared errors  MS for residual: variance of predicted y/d.f. Top-right column  Root MSE: standard error of predicted y  & moreover there’s R 2 (as well F & other indicators that we’ll examine next week).

. reg science math Source | SS df MS Number of obs = F( 1, 198) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = science | Coef. Std. Err. t P>|t| [95% Conf. Interval] math | _cons |  SS Residual/df Residual=MS Resid: variance of yhat. Root MSE=sqrt(MS Residual): standard error of yhat.

. reg science math Source | SS df MS Number of obs = F( 1, 198) = Model | Prob > F = Residual R-squared = Adj R-squared = Total | Root MSE = science | Coef. Std. Err. t P>|t| [95% Conf. Interval] math | _cons |  r 2 =SS Model/SS Total

. reg science math Source | SS df MS Number of obs = F( 1, 198) = Model | Prob > F = Residual R-squared = Adj R-squared = Total | Root MSE = science | Coef. Std. Err. t P>|t| [95% Conf. Interval] math | _cons |  F=MSM/MSR

 The most basic ways to make a linear prediction of y (i.e. yhat) after estimating a simple regression model?. display display *45. lincom _cons + math. lincom _cons + math*45 (lincom: linear combination; provides a confidence interval for prediction)

 In summary, we use a six-step procedure to create a regression model:  Hypothesize the form of the model for E(y).  Collect the sample data: random sample, data for the regression variables collected on the same subjects  Use the sample data to estimate unknown parameters in the model.

(4) Specify the probability distribution of the random error term, & estimate any unknown parameters of this distribution. (5) Statistically check the usefulness of the model. (6) When satisfied that the model is useful, use it for prediction, estimation, & so on.  See King et al.

 Finally, the four fundamental assumptions of regression analysis involve the probability distribution of e (the model’s random component, which consists of the residuals).  These assumptions can be summarized as I.I.D.

 The univariate distributions of the variables for regression analysis need not be normal!  But the usual caveats concerning extreme outliers are important.  It’s not the univariate graphs but the y/x bivariate scatterplots that provide the key evidence on these concerns.

 We’ll nonetheless see that the characteristics of bivariate relationships do not necessarily predict whether explanatory variables will test significant or the direction of their coefficients in a multiple regression model.  We’ll see, rather, that a multiple regression model expresses the joint, linear effects of a set of explanatory variables on an outcome variable.

Review: Regress science achievement scores on math achievement scores.use hsb2, clear Note: recall that these are not randomly sampled data.

. hist science, norm

. hist math, norm

. su science, detail sciencescore PercentilesSmallest 1%3026 5% %3931Obs200 25%4431Sum of Wgt %53Mean51.85 LargestStd. Dev % %64.572Variance %66.572Skewness %7274Kurtosis

. su math, d math score PercentilesSmallest 1%3633 5% %4037Obs %4538Sum of Wgt %52Mean LargestStd. Dev % %65.573Variance %70.575Skewness %7475Kurtosis

. scatter science math || qfit science math  Conclusion about approximate linearity & outliers?

. pwcorr science math, obs bonf sig star(.05) science math *  Formula for correlation coefficient?  Hypothesis test & conclusion?

. reg science math Source SS df MSNumber of obs= 200 F( 1, 198)= Model Prob > F= Residual R-squared= Total Adj R-squared= Root MSE= science Coef. Std. Err. tP>t[95% Conf. Interval] math _cons  # observations? df? residuals, formula? yhat variance, formula? yhat standard error, formula? F, formula? r 2, formula? y-intercept, CI, formula? slope coefficient, CI, formula? slope hypothesis test?

 Graph the linear prediction for yhat with a confidence interval:. twoway qfitci science math, blc(blue)

 Predictions of yhat using STATA’s calculator :. display * di *

. lincom _cons + math*45 ( 1) 45.0 math + _cons = science| Coef. Std. Err. t P>|t| [95% Conf. Interval] (1) | lincom _cons + math*65 ( 1) 65.0 math + _cons = science| Coef. Std. Err. t P>|t| [95% Conf. Interval] (1) |  Predictions for yhat using ‘lincom’: