Presentation on theme: "SI0030 Social Research Methods Week 6 Luke Sloan"— Presentation transcript:
1SI0030 Social Research Methods Week 6 Luke Sloan Quantitative Data Analysis II: Correlation and Simple Linear RegressionSI0030Social Research MethodsWeek 6Luke Sloan
2Introduction Last Week – Recap Correlation How To Draw A Line Simple Linear RegressionSummary
3Last Week - Recap Hypotheses Probability & Significance (p=<0.05) Chi-square test for two categorical variablest-test for one categorical and one interval variablesWhat about a test for two interval variables?...
4Correlation ICalculates the strength and direction of a linear relationship between two interval variablese.g. is there a relationship between age and income?Measured using the Pearson correlation coefficient (r)Data must be normally distributed (check with a histogram)If not normally distributed use Spearman’s Rank Order Correlation (rho) - consult Pallant (2005:297)
5Correlation II ‘r’ can take any value from +1 to -1 +/- indicates whether the relationship is positive or negative+1 or -1 is a perfect linear relationship, but usually it is not this clear cutRule of thumb:+/- 0.7 = a strong linear relationship+/- 0.5 = a good linear relationship+/- 0.3 = a linear relationshipBelow +/- 0.3 = weak linear relationship0 = no linear relationshipAlternatively:+/ to 0.29 = weak+/ to 0.49 = medium+/ to strong
6Positive relationship Negative relationship Correlation IIIPositive relationshipNo relationshipNegative relationshipPositiveRelationshipNoRelationshipNegativeRelationshipFormulate hypotheses and use scatter plots!
7Correlation IVH1 = There is a relationship between Age and the number of years a candidate has been a member of a political partyH0 = There is no relationship between Age and the number of years a candidate has been a member of a political partyWhat do you think?
8Is this normal? Just to prove a point… Correlation VIs this normal? Just to prove a point…
9Correlation VIPerfect correlation against itself (obviously!) and number of cases in analysisCorrelationsWhat was your age last birthdayNumber of years a party memberPearson Correlation1.425**Sig. (2-tailed).000N448118741936**. Correlation is significant at the 0.01 level (2-tailed).Significance for correlation is problematic (highly dependent on sample size) – report p-value but ignore level of significancePearson’s Correlation Coefficient is r=0.43 – medium/good positive linear relationship
10Correlation VIIDon’t forget to refute or accept the null hypothesis and discuss the relationshipCorrelation is not causation!The relationship between the number of years a candidate has been a member of a party and candidate age was explored using Pearson’s correlation coefficient. Both variables were confirmed to have normal distributions [?] and a scatter plot revealed a linear relationship. There was a medium-strength, positive relationship between the two variables (r=0.43, n=4481, p<0.05)... [go on to explain the relationship in detail]
11The line of best fit is a predictive – it is the regression line! How To Draw A Line ICorrelation is indicative of a relationship, but it does not allow us to quantify itWhat if we wanted to explain how an increase in age leads to an increase in years of party membership?What if we wanted to predict years of party membership based only on age?The line of best fit is a predictive – it is the regression line!
12How To Draw A Line IIThe regression line allows us to predict any given value of y when we know xi.e. if we know the age of a candidate we can predict how long they are likely to have been a member of a political partyAnother (more useful!) example would be years in education and incomeUsing a regression line we can predict someone’s income based on the number of years they have been in educationAssumes a causal relationship – that income is ‘caused’ by years in education
13How To Draw A Line III y = a + b x But… we don’t simply look very closely at the line and the axis of the scatter plot because the regression line can be written as an equation:y = a + b x‘y’ represents the dependent variable (what we are trying to predict) e.g. income‘a’ represents the intercept(where the regression line crosses the vertical ‘y’ axis) aka the constant‘b’ represents the slope of the line (the association between ‘y’ & ‘x’) e.g. how income changes in relation to education‘x’ represents the independent variable (what we are using to predict ‘y’) e.g. years in education
14How To Draw A Line IV y axis x axis y = 0 + 2x y = 0 + 1x y = 0 + 0.5x What about…y = xy = 1 + 1xx axis
15Simple Linear Regression If we know the slope (b) and the intercept (a), for any given value of ‘x’ we can predict ‘y’EXAMPLE: predicting income (y) in thousands (£) from years in education (x)Preconditions:Equations:Intercept (a) = 4y = a + bxOr…Slope (b) = 1.5Income = intercept + (slope*years in education)For someone with 10 years of educationOr…Income = 4 + (1.5*10) = 19 (£19,000)
16Simple Linear Regression II AssumptionsInterval level dataLinearity between ‘x’ and ‘y’Outliers (check scatter plot)Sample size = 100+?R2 measure of ‘model fit’Literally the Pearson’s correlation coefficient squaredR2 tells us how much of the variance in the dependent variable is explained by the independent variable e.g. how much of the variance in income can be explained by ageExpressed as a percentage (1.0 = 100%, 0.5 = 50% etc)
17Simple Linear Regression III H0 = There is no relationship between Age and the number of years a candidate has been a member of a political partyH1 = There is a relationship between Age and the number of years a candidate has been a member of a political partyH2 = As the age of a candidate increases, so will the number of years that they have been a party member‘Years as Party Member’ = intercept + (slope * ’Age’)
18Simple Linear Regression IV Pearson’s correlation coefficient (same value!)18% of variance in party membership (y) explained by age (x)Model SummaryModelRR SquareAdjusted R SquareStd. Error of the Estimate1.425a.181.18011.995a. Predictors: (Constant), What was your age last birthdayThis tests the hypothesis that the model is a better predictor of party membership than if we simply used the mean value of party membershipp<0.05 so the regression model is a significantly better predictor than the mean valueANOVAbModelSum of SquaresdfMean SquareFSig.1Regression.000aResidual1872Total1873a. Predictors: (Constant), What was your age last birthdayb. Dependent Variable: Number of years a party member
19Simple Linear Regression V y = a + b xp<0.05 so ‘Age’ has a significant effect on ‘Party Membership’This is the intercept (a)This is the slope (b)CoefficientsaModelUnstandardized CoefficientsStandardized CoefficientstSig.BStd. ErrorBeta1(Constant)-6.8991.156-5.966.000What was your age last birthday.418.021.42520.327a. Dependent Variable: Number of years a party memberA one unit increase in age will result in an increase in party membership of 0.42‘Party Membership’ = (0.42 * ’Age’)Or…
20Simple Linear Regression VI … and this is what we saw in the original scatter plot!The ‘regression line’ will intercept the verticle (y) axis at -6.9The ‘regression line’ rises by 0.42 on the verticle axis (y) for every one unit increase on the horizontal axis (x)The R2 value is low because of the fanning effect (remember the histograms!)
21SummaryHow to describe and quantify the relationship between two interval variablesCorrelation – the strength and direction of the associationRegression – the causal and quantified effect of an independent on a dependent variable