Presentation is loading. Please wait.

Presentation is loading. Please wait.

Correlation Oh yeah!. Outline Basics Visualization Covariance Significance testing and interval estimation Effect size Bias Factors affecting correlation.

Similar presentations

Presentation on theme: "Correlation Oh yeah!. Outline Basics Visualization Covariance Significance testing and interval estimation Effect size Bias Factors affecting correlation."— Presentation transcript:

1 Correlation Oh yeah!

2 Outline Basics Visualization Covariance Significance testing and interval estimation Effect size Bias Factors affecting correlation Issues with correlational studies

3 Correlation Research question: What is the relationship between two variables? Correlation is a measure of the direction and degree of linear association between 2 variables. Correlation is the standardized covariance between two variables

4 Questions to be asked… Is there a linear relationship between x and y? What is the strength of this relationship? ▫Pearson Product Moment Correlation Coefficient r Can we describe this relationship and use this to predict y from x? ▫y=bx+a Is the relationship we have described statistically significant? ▫Not a very interesting one if tested against a null of r = 0

5 Other stuff Check scatterplots to see whether a Pearson r makes sense Use both r and R 2 to understand the situation If data is non-metric or non-normal, use “non- parametric” correlations Correlation does not prove causation ▫True relationship may be in opposite direction, co- causal, or due to other variables However, correlation is the primary statistic used in making an assessment of causality ▫‘Potential’ Causation

6 Possible outcomes -1 to +1 As one variable increases/decreases, the other variable increases/decreases ▫Positive covariance As one variable increases/decreases, another decreases/increases ▫Negative covariance No relationship (independence) ▫r = 0 Non-linear relationship?

7 Scatterplots As we discussed previously, scatterplots provide a pictorial examination of the relationship between two quantitative variables Predictor variable on the X-axis (abscissa); Criterion variable on the Y-axis (ordinate) Each subject is located in the scatterplot by means of a pair of scores (score on the X variable and score on the Y variable) ▫Plot each pair of observations (X, Y)  X = predictor variable (independent)  Y = criterion variable (dependent) Check for linear relationship ▫‘Line of best fit’ ▫y = a + bx Check for outliers

8 Example of a Scatterplot The relationship between scores on a test of quantitative skills taken by students on the first day of a stats course (X- axis) and their combined scores on two semester exams (Y-axis)

9 Example of a Scatterplot The two variables are positively related ▫As quantitative skill increases, so does performance on the two midterm exams Linear relationship between the variables ▫Line of best fit drawn on the graph - the ‘regression line’ The ‘strength’ or ‘degree’ of the liner relationship is measured by a correlation coefficient i.e. how tightly the data points cluster around the regression line We can use this information to determine whether the linear relationship represents a true relationship in the population or is due entirely to chance factors

10 What do we look for in a Scatterplot? Overall pattern: Ellipse ▫Any striking deviations (outliers) Form: is it linear? (curved? clustered?) Direction: is it positive…  high values of the two variables tend to occur together) ▫Or negative  high values of one variable tend to occur with low values of the other variable)? Strength: how close the points lie to the line of best fit (if a linear relationship)

11 r =1

12 r = 0.95

13 r = 0.7

14 r = 0.4

15 r = -0.4

16 r = -0.7

17 r = -0.95

18 r = -1

19 Linear Correlation / Covariance How do we obtain a quantitative measure of the linear association between X and Y? The Pearson Product-Moment Correlation Coefficient, r, comes from the covariance statistic, it reflects the degree to which the two variables vary together

20 Covariance The variance shared by two variables When X and Y move in the same direction (i.e. their deviations from the mean are similarly pos or neg) ▫cov (x,y) = pos. When X and Y move in opposite directions ▫cov (x,y) = neg. When no constant relationship ▫cov (x,y) = 0

21 Covariance Covariance is not easily interpreted on its own and cannot be compared across different scales of measurement Solution: standardize this measure Pearson’s r:

22 Significance test for correlation All correlations in a practical setting will be non- zero A significance test can be conducted in an effort to infer to a population Key Question: “Is the r large enough that it is unlikely to have come from a population in which the two variables are unrelated?” Testing the null hypothesis that ▫H 0 :  = 0 vs. alternative hypothesis H 1 :  ≠ 0   =population product-moment correlation coefficient

23 Significance test for correlation However with larger N, small, possibly non-meaningful, correlations can be deemed ‘significant’ So the better question is: Is a test against zero useful? Tests of significance for r have typically have limited utility if testing against a zero value Go by the size 1 and judge worth by what is seen in the relevant literature dfcritical N-2  =.05 5.67 10.50 15.41 20.36 25.32 30.30 50.23 200.11 500.07 1000.05

24 Significance test for correlation Furthermore, using the approaches outlined in Howell, while standard, are really not necessary Using the t-distribution as described we would only really be able to test a null hypothesis of zero If we want to test against some specific value 1, we have to convert r in some odd fashion and test using these new values ▫Fisher transformation df = N - 2

25 Test of the difference between two rs While those new values create an r′ that approximates a normal distribution, why do we have to do it? The reason for this transformation is that since r has limits of +1, the larger the absolute value of r, the more skewed its sampling distribution about the population  (rho)

26 Sampling distribution of a correlation Via the bootstrap, we can a see for ourselves that the sampling distribution becomes more and more skewed as we deviate from a null value of zero

27 The better approach Nowadays, we can bootstrap the r or difference between two rs and do hypothesis tests without unnecessary (and most likely problematic) transformations and assumptions ▫Even for small samples of about 30 it performs as well as the transformation in ideal situations (Efron, 1988) Furthermore, it can be applied to other correlation metrics.

28 Correlation Typically though, for a single sample correlations among the variables should be considered descriptive statistics 1, and often the correlation matrix is the data set that forms the basis of an analysis A correlation can also be thought of as an effect size in and of itself ▫Standardized measure of amount of covariation ▫The strength and degree of a linear relationship between variables ▫The amount some variable moves in standard deviation units with a 1 standard deviation change in another variable R 2 is also an effect size ▫Amount of variability seen in y that can be explained by the variability seen in x ▫Amount of variance they share 2

29 Biased estimate- Adjusted r r turns out to be upwardly biased, and the smaller the sample size, the greater the bias ▫With large samples the difference will be negligible With smaller samples one should report adjusted r or R 2

30 Factors affecting correlation Linearity Heterogeneous subsamples Range restrictions Outliers

31 Linearity Nonlinear relationships will have an adverse effect on a measure designed to find a linear relationship

32 Heterogeneous subsamples Sub-samples may artificially increase or decrease overall r, or in a corollary to Simpson’s paradox, produce opposite sign relations for the aggregated data compared to the groups Solution - calculate r separately for sub-samples & overall, look for differences

33 Heterogeneous subsamples

34 Range restriction Limiting the variability of your data can in turn limit the possibility for covariability between two variables, thus attenuating r. Common example occurs with Likert scales ▫E.g. 1 - 4 vs. 1 - 9 However it is also the case that restricting the range can actually increase r if by doing so, highly influential data points would be kept out ▫Wilcox 2001

35 Effect of Outliers Outliers can artificially increase or decrease r Options ▫Compute r with and without outliers ▫Conduct robustified R!  For example, recode outliers as having more conservative scores (winsorize) ▫Transform variables (last resort)

36 Advantages of correlational studies Show the amount (strength) of relationship present Can be used to make predictions about the variables studied Often easier to collect correlational data, and interpretation is fairly straightforward.

37 Disadvantages of correlational studies Can’t assume that a cause-effect relationship exists Little or no control (experimental manipulation) of the variables is usually seen Relationships may be accidental or due to a third variable, unmeasured factor ▫Common causes ▫Spurious correlations and Mediators

Download ppt "Correlation Oh yeah!. Outline Basics Visualization Covariance Significance testing and interval estimation Effect size Bias Factors affecting correlation."

Similar presentations

Ads by Google