3 Plan: 1. Linear & Non-linear Relationships 2. Fitting a line using OLS 3. Inference in Regression4. Ommitted Variables & R25. Types of Regression Analysis6. Properties of OLS Estimates7. Assumptions of OLS8. Doing Regression in SPSS
4 1. Linear & Non-linear relationships between variables Often of greatest interest in social science is investigation into relationships between variables:is social class related to political perspective?is income related to education?is worker alienation related to job monotony?We are also interested in the direction of causation, but this is more difficult to prove empirically:our empirical models are usually structured assuming a particular theory of causation
5 Relationships between scale variables The most straight forward way to investigate evidence for relationship is to look at scatter plots:traditional to:put the dependent variable (I.e. the “effect”) on the vertical axisor “y axis”put the explanatory variable (I.e. the “cause”) on the horizontal axisor “x axis”
15 Or could try two linear lines: “structural break”
16 2. Fitting a line using OLS The most popular algorithm for drawing the line of best fit is one that minimises the sum of squared deviations from the line to each observation:Where:yi = observed value of y= predicted value of yi= the value on the line of best fit corresponding to xi
17 Regression estimates of a, b: or Ordinary Least Squares (OLS): This criterion yields estimates of the slope b and y-intercept a of the straight line:
18 3. Inference in Regression: Hypothesis tests on the slope coefficient: Regressions are usually run on samples, so what can we say about the population relationship between x and y?Repeated samples would yield a range of values for estimates of b ~ N(b, sb)I.e. b is normally distributed with mean = b = population mean = value of b if regression run on populationIf there is no relationship in the population between x and y, then b = 0, & this is our H0
20 Hypothesis test on b: (1) H0: b = 0 H1: b 0 (I.e. slope coefficient, if regression run on population, would = 0)H1: b 0(2) a = 0.05 or 0.01 etc.(3) Reject H0 iff P < a(N.B. Rule of thumb: P < 0.05 if tc 2, and P < 0.01 if tc 2.6)(4) Calculate P and conclude.
21 Example using SPSS output: (1) H0: no relationship between house price and floor area.H1: there is a relationship(2), (3), (4):P = 1- CDF.T(24.469,554) =Reject H0
22 4. Ommitted Variables & R2 Q/ is floor area the only factor 4. Ommitted Variables & R2 Q/ is floor area the only factor? How much of the variation in Price does it explain?
23 R-squareR-square tells you how much of the variation in y is explained by the explanatory variable x0 < R2 < (NB: you want R2 to be near 1).If more than one explanatory variable, use Adjusted R2
27 Construction Equation in a Slump Q = 315 + 4P - 73U + 5U2
28 5. Types of regression analysis: Univariate regression: one explanatory variablewhat we’ve looked at so far in the above equationsMultivariate regression: >1 explanatory variablemore than one equation on the RHSLog-linear regression & log-log regression:taking logs of variables can deal with certain types of non-linearities & useful properties (e.g. elasticities)Categorical dependent variable regression:dependent variable is dichotomous -- observation has an attribute or note.g. MPPI take-up, unemployed or not etc.
29 6. Properties of OLS estimators OLS estimates of the slope and intercept parameters have been shown to be BLUE (provided certain assumptions are met):BestLinearUnbiasedEstimator
30 “Best” in that they have the minimum variance compared with other estimators (i.e. given repeated samples, the OLS estimates for α and β vary less between samples than any other sample estimates for α and β).“Linear” in that a straight line relationship is assumed.“Unbiased” because, in repeated samples, the mean of all the estimates achieved will tend towards the population values for α and β.“Estimates” in that the true values of α and β cannot be known, and so we are using statistical techniques to arrive at the best possible assessment of their values, given the information available.
31 7. Assumptions of OLS:For estimation of a and b to be BLUE and for regression inference to be correct:1. Equation is correctly specified:Linear in parameters (can still transform variables)Contains all relevant variablesContains no irrelevant variablesContains no variables with measurement errors2. Error Term has zero mean3. Error Term has constant variance
32 4. Error Term is not autocorrelated I.e. correlated with error term from previous time periods5. Explanatory variables are fixedobserve normal distribution of y for repeated fixed values of x6. No linear relationship between RHSvariablesI.e. no “multicolinearity”
33 8. Doing Regression analysis in SPSS To run regression analysis in SPSS, click on Analyse, Regression, Linear:
34 Select your dependent (i. e. ‘explained’) variable and independent (i Select your dependent (i.e. ‘explained’) variable and independent (i.e. ‘explanatory’) variables:
35 e.g. Floor area and bathrooms: Floor area = a + b Number of bathrooms + e
37 Confidence Intervals for regression coefficients Population slope coefficient CI:Rule of thumb:
38 e.g. regression of floor area on number of bathrooms, CI on slope: = 7.695% CI = ( 57, 72)
39 Confidence Intervals in SPSS: Analyse, Regression, Linear, click on Statistics and select Confidence intervals:
40 Our rule of thumb said 95% CI for slope = ( 57, 72) Our rule of thumb said 95% CI for slope = ( 57, 72). How does this compare?
41 Past Paper: (C2)Relationships (30%) Suppose you have a theory that suggests that time watching TV is determined by gregariousnessthe less gregarious, the more time spent watching TVUse a random sample of 60 observations from the TV watching data to run a statistical test for this relationship that also controls for the effects of age and gender.Carefully interpret the output from this model and discuss the statistical robustness of the results.Suppose you have a theory that suggests that time watching TV is determined by gregariousness (the less gregarious, the more time spent watching TV). Use the random sample of 60 observations from TV watching data to run a statistical test for this relationship that also controls for the effects of age and gender. Carefully interpret the output from this model and discuss the statistical robustness of the results.
42 Reading: Regression Analysis: *Field, A. chapters on regression. *Moore and McCabe Chapters on regression.Kennedy, P. ‘A Guide to Econometrics’Bryman, Alan, and Cramer, Duncan (1999) “Quantitative Data Analysis with SPSS for Windows: A Guide for Social Scientists”, Chapters 9 and 10.Achen, Christopher H. Interpreting and Using Regression (London: Sage, 1982).