# Lecture 8 Relationships between Scale variables: Regression Analysis

## Presentation on theme: "Lecture 8 Relationships between Scale variables: Regression Analysis"— Presentation transcript:

Lecture 8 Relationships between Scale variables: Regression Analysis
Graduate School Quantitative Research Methods Gwilym Pryce Lecture 8 Relationships between Scale variables: Regression Analysis

Notices: Register

Plan: 1. Linear & Non-linear Relationships 2. Fitting a line using OLS
3. Inference in Regression 4. Ommitted Variables & R2 5. Types of Regression Analysis 6. Properties of OLS Estimates 7. Assumptions of OLS 8. Doing Regression in SPSS

1. Linear & Non-linear relationships between variables
Often of greatest interest in social science is investigation into relationships between variables: is social class related to political perspective? is income related to education? is worker alienation related to job monotony? We are also interested in the direction of causation, but this is more difficult to prove empirically: our empirical models are usually structured assuming a particular theory of causation

Relationships between scale variables
The most straight forward way to investigate evidence for relationship is to look at scatter plots: traditional to: put the dependent variable (I.e. the “effect”) on the vertical axis or “y axis” put the explanatory variable (I.e. the “cause”) on the horizontal axis or “x axis”

Scatter plot of IQ and Income:

We would like to find the line of best fit:

What does the output mean?

Sometimes the relationship appears non-linear:

… and so a straight line of best fit is not always very satisfactory:

Could try a quadratic line of best fit:

But we can simulate a non-linear relationship by first transforming one of the variables:

… or a cubic line of best fit: (overfitted?)

Or could try two linear lines: “structural break”

2. Fitting a line using OLS
The most popular algorithm for drawing the line of best fit is one that minimises the sum of squared deviations from the line to each observation: Where: yi = observed value of y = predicted value of yi = the value on the line of best fit corresponding to xi

Regression estimates of a, b: or Ordinary Least Squares (OLS):
This criterion yields estimates of the slope b and y-intercept a of the straight line:

3. Inference in Regression: Hypothesis tests on the slope coefficient:
Regressions are usually run on samples, so what can we say about the population relationship between x and y? Repeated samples would yield a range of values for estimates of b ~ N(b, sb) I.e. b is normally distributed with mean = b = population mean = value of b if regression run on population If there is no relationship in the population between x and y, then b = 0, & this is our H0

What does the standard error mean?

Hypothesis test on b: (1) H0: b = 0 H1: b  0
(I.e. slope coefficient, if regression run on population, would = 0) H1: b  0 (2) a = 0.05 or 0.01 etc. (3) Reject H0 iff P < a (N.B. Rule of thumb: P < 0.05 if tc  2, and P < 0.01 if tc  2.6) (4) Calculate P and conclude.

Example using SPSS output:
(1) H0: no relationship between house price and floor area. H1: there is a relationship (2), (3), (4): P = 1- CDF.T(24.469,554) = Reject H0

4. Ommitted Variables & R2 Q/ is floor area the only factor
4. Ommitted Variables & R2 Q/ is floor area the only factor? How much of the variation in Price does it explain?

R-square R-square tells you how much of the variation in y is explained by the explanatory variable x 0 < R2 < (NB: you want R2 to be near 1). If more than one explanatory variable, use Adjusted R2

Example: 2 explanatory variables

Scatter plot (with floor spikes)

3D Surface Plots: Construction, Price & Unemployment Q = -246 + 27P - 0.2P2 - 73U + 3U2

Construction Equation in a Slump Q = 315 + 4P - 73U + 5U2

5. Types of regression analysis:
Univariate regression: one explanatory variable what we’ve looked at so far in the above equations Multivariate regression: >1 explanatory variable more than one equation on the RHS Log-linear regression & log-log regression: taking logs of variables can deal with certain types of non-linearities & useful properties (e.g. elasticities) Categorical dependent variable regression: dependent variable is dichotomous -- observation has an attribute or not e.g. MPPI take-up, unemployed or not etc.

6. Properties of OLS estimators
OLS estimates of the slope and intercept parameters have been shown to be BLUE (provided certain assumptions are met): Best Linear Unbiased Estimator

“Best” in that they have the minimum variance compared with other estimators (i.e. given repeated samples, the OLS estimates for α and β vary less between samples than any other sample estimates for α and β). “Linear” in that a straight line relationship is assumed. “Unbiased” because, in repeated samples, the mean of all the estimates achieved will tend towards the population values for α and β. “Estimates” in that the true values of α and β cannot be known, and so we are using statistical techniques to arrive at the best possible assessment of their values, given the information available.

7. Assumptions of OLS: For estimation of a and b to be BLUE and for regression inference to be correct: 1. Equation is correctly specified: Linear in parameters (can still transform variables) Contains all relevant variables Contains no irrelevant variables Contains no variables with measurement errors 2. Error Term has zero mean 3. Error Term has constant variance

4. Error Term is not autocorrelated
I.e. correlated with error term from previous time periods 5. Explanatory variables are fixed observe normal distribution of y for repeated fixed values of x 6. No linear relationship between RHS variables I.e. no “multicolinearity”

8. Doing Regression analysis in SPSS
To run regression analysis in SPSS, click on Analyse, Regression, Linear:

Select your dependent (i. e. ‘explained’) variable and independent (i
Select your dependent (i.e. ‘explained’) variable and independent (i.e. ‘explanatory’) variables:

e.g. Floor area and bathrooms: Floor area = a + b Number of bathrooms + e

Confidence Intervals for regression coefficients
Population slope coefficient CI: Rule of thumb:

e.g. regression of floor area on number of bathrooms, CI on slope:
=  7.6 95% CI = ( 57, 72)

Confidence Intervals in SPSS: Analyse, Regression, Linear, click on Statistics and select Confidence intervals:

Our rule of thumb said 95% CI for slope = ( 57, 72)
Our rule of thumb said 95% CI for slope = ( 57, 72). How does this compare?

Past Paper: (C2)Relationships (30%)
Suppose you have a theory that suggests that time watching TV is determined by gregariousness the less gregarious, the more time spent watching TV Use a random sample of 60 observations from the TV watching data to run a statistical test for this relationship that also controls for the effects of age and gender. Carefully interpret the output from this model and discuss the statistical robustness of the results. Suppose you have a theory that suggests that time watching TV is determined by gregariousness (the less gregarious, the more time spent watching TV). Use the random sample of 60 observations from TV watching data to run a statistical test for this relationship that also controls for the effects of age and gender. Carefully interpret the output from this model and discuss the statistical robustness of the results.

Reading: Regression Analysis: *Field, A. chapters on regression.
*Moore and McCabe Chapters on regression. Kennedy, P. ‘A Guide to Econometrics’ Bryman, Alan, and Cramer, Duncan (1999) “Quantitative Data Analysis with SPSS for Windows: A Guide for Social Scientists”, Chapters 9 and 10. Achen, Christopher H. Interpreting and Using Regression (London: Sage, 1982).