# Correlation Oh yeah!.

## Presentation on theme: "Correlation Oh yeah!."— Presentation transcript:

Correlation Oh yeah!

Outline Basics Visualization Covariance
Significance testing and interval estimation Effect size Bias Factors affecting correlation Issues with correlational studies

Correlation Research question: What is the relationship between two variables? Correlation is a measure of the direction and degree of linear association between 2 variables. Correlation is the standardized covariance between two variables

Questions to be asked… Is there a linear relationship between x and y?
What is the strength of this relationship? Pearson Product Moment Correlation Coefficient r Can we describe this relationship and use this to predict y from x? y=bx+a Is the relationship we have described statistically significant? Not a very interesting one if tested against a null of r = 0

Other stuff Check scatterplots to see whether a Pearson r makes sense
Use both r and R2 to understand the situation If data is non-metric or non-normal, use “non- parametric” correlations Correlation does not prove causation True relationship may be in opposite direction, co- causal, or due to other variables However, correlation is the primary statistic used in making an assessment of causality ‘Potential’ Causation

Possible outcomes -1 to +1
As one variable increases/decreases, the other variable increases/decreases Positive covariance As one variable increases/decreases, another decreases/increases Negative covariance No relationship (independence) r = 0 Non-linear relationship?

Scatterplots As we discussed previously, scatterplots provide a pictorial examination of the relationship between two quantitative variables Predictor variable on the X-axis (abscissa); Criterion variable on the Y-axis (ordinate) Each subject is located in the scatterplot by means of a pair of scores (score on the X variable and score on the Y variable) Plot each pair of observations (X, Y) X = predictor variable (independent) Y = criterion variable (dependent) Check for linear relationship ‘Line of best fit’ y = a + bx Check for outliers

Example of a Scatterplot
The relationship between scores on a test of quantitative skills taken by students on the first day of a stats course (X- axis) and their combined scores on two semester exams (Y-axis)

Example of a Scatterplot
The two variables are positively related As quantitative skill increases, so does performance on the two midterm exams Linear relationship between the variables Line of best fit drawn on the graph - the ‘regression line’ The ‘strength’ or ‘degree’ of the liner relationship is measured by a correlation coefficient i.e. how tightly the data points cluster around the regression line We can use this information to determine whether the linear relationship represents a true relationship in the population or is due entirely to chance factors

What do we look for in a Scatterplot?
Overall pattern: Ellipse Any striking deviations (outliers) Form: is it linear? (curved? clustered?) Direction: is it positive… high values of the two variables tend to occur together) Or negative high values of one variable tend to occur with low values of the other variable)? Strength: how close the points lie to the line of best fit (if a linear relationship)

r =1

r = 0.95

r = 0.7

r = 0.4

r = -0.4

r = -0.7

r = -0.95

r = -1

Linear Correlation / Covariance
How do we obtain a quantitative measure of the linear association between X and Y? The Pearson Product-Moment Correlation Coefficient, r, comes from the covariance statistic, it reflects the degree to which the two variables vary together

Covariance The variance shared by two variables
When X and Y move in the same direction (i.e. their deviations from the mean are similarly pos or neg) cov (x,y) = pos. When X and Y move in opposite directions cov (x,y) = neg. When no constant relationship cov (x,y) = 0

Covariance Covariance is not easily interpreted on its own and cannot be compared across different scales of measurement Solution: standardize this measure Pearson’s r:

Significance test for correlation
All correlations in a practical setting will be non- zero A significance test can be conducted in an effort to infer to a population Key Question: “Is the r large enough that it is unlikely to have come from a population in which the two variables are unrelated?” Testing the null hypothesis that H0: r = 0 vs. alternative hypothesis H1: r ≠ 0 r=population product-moment correlation coefficient

Significance test for correlation
However with larger N, small, possibly non-meaningful, correlations can be deemed ‘significant’ So the better question is: Is a test against zero useful? Tests of significance for r have typically have limited utility if testing against a zero value Go by the size1 and judge worth by what is seen in the relevant literature df critical N-2 =.05 1. It’s not too hard to remember that with ~50 cases you only need ~.20 to ‘flag’ a correlation as statistically significant with the typical tailed approach, and in many situations that’s probably not considered a practical effect. Asterisks aren’t really necessary, are ugly, and make small correlations that are probably unimportant misleadingly displayed as such. Flagging isn’t necessarily a bad approach, it’s just used poorly and often incorrectly (in the case of identification of multiple alpha levels).

Significance test for correlation
Furthermore, using the approaches outlined in Howell, while standard, are really not necessary Using the t-distribution as described we would only really be able to test a null hypothesis of zero If we want to test against some specific value1, we have to convert r in some odd fashion and test using these new values Fisher transformation df = N - 2 1. Regarding testing against a non-nil r, Howell states “You probably can’t think of many situations in which you would like to do that, and neither can I.” This is definitely one place I would disagree with Howell. I typically can’t think of any reason to test against a nil value except in purely exploratory areas of research.

Test of the difference between two rs
While those new values create an r′ that approximates a normal distribution, why do we have to do it? The reason for this transformation is that since r has limits of +1, the larger the absolute value of r, the more skewed its sampling distribution about the population  (rho)

Sampling distribution of a correlation
Via the bootstrap, we can a see for ourselves that the sampling distribution becomes more and more skewed as we deviate from a null value of zero

The better approach Nowadays, we can bootstrap the r or difference between two rs and do hypothesis tests without unnecessary (and most likely problematic) transformations and assumptions Even for small samples of about 30 it performs as well as the transformation in ideal situations (Efron, 1988) Furthermore, it can be applied to other correlation metrics.

Correlation Typically though, for a single sample correlations among the variables should be considered descriptive statistics1, and often the correlation matrix is the data set that forms the basis of an analysis A correlation can also be thought of as an effect size in and of itself Standardized measure of amount of covariation The strength and degree of a linear relationship between variables The amount some variable moves in standard deviation units with a 1 standard deviation change in another variable R2 is also an effect size Amount of variability seen in y that can be explained by the variability seen in x Amount of variance they share2 Hence no need for flagging since an inferential analysis is not performed. Interesting example: For IQ it is said that 80% is accounted for by heredity, 20% by environment. 80% and 20% are the variance accounted for in IQ by each, i.e. R2. So heredity is 4 times as important as environment for IQ right? If you look at the correlations (and we can for the most part assume independence between genetics and environment), r(heredity,IQ) = sqrt(.8) = .894, r(environment,IQ) = sqrt(.2) = So in terms of how much we expect IQ to move as a result of a 1 standard deviation unit change in the predictor, the difference is only ~2:1 in favor of heredity (.894 to .447)

r turns out to be upwardly biased, and the smaller the sample size, the greater the bias With large samples the difference will be negligible With smaller samples one should report adjusted r or R2

Factors affecting correlation
Linearity Heterogeneous subsamples Range restrictions Outliers

Linearity Nonlinear relationships will have an adverse effect on a measure designed to find a linear relationship

Heterogeneous subsamples
Sub-samples may artificially increase or decrease overall r, or in a corollary to Simpson’s paradox, produce opposite sign relations for the aggregated data compared to the groups Solution - calculate r separately for sub-samples & overall, look for differences

Heterogeneous subsamples

Range restriction Limiting the variability of your data can in turn limit the possibility for covariability between two variables, thus attenuating r. Common example occurs with Likert scales E.g vs However it is also the case that restricting the range can actually increase r if by doing so, highly influential data points would be kept out Wilcox 2001

Effect of Outliers Outliers can artificially increase or decrease r
Options Compute r with and without outliers Conduct robustified R! For example, recode outliers as having more conservative scores (winsorize) Transform variables (last resort)