Presentation is loading. Please wait.

Presentation is loading. Please wait.

Correlation Review and Extension. Questions to be asked… Is there a linear relationship between x and y? What is the strength of this relationship? Pearson.

Similar presentations


Presentation on theme: "Correlation Review and Extension. Questions to be asked… Is there a linear relationship between x and y? What is the strength of this relationship? Pearson."— Presentation transcript:

1 Correlation Review and Extension

2 Questions to be asked… Is there a linear relationship between x and y? What is the strength of this relationship? Pearson Product Moment Correlation Coefficient (r) Can we describe this relationship and use this to predict y from x? y=bx+a Is the relationship we have described statistically significant? Not a very interesting one if tested against a null of r = 0

3 Other stuff Check scatterplots to see whether a Pearson r makes sense Use both r and r 2 to understand the situation If data is non-metric or non-normal, use “non- parametric” correlations Correlation does not prove causation True relationship may be in opposite direction, co- causal, or due to other variables However, correlation is the primary statistic used in making an assessment of causality ‘Potential’ Causation

4 Possible outcomes -1 to +1 As one variable increases/decreases, the other variable increases/decreases Positive covariance As one variable increases/decreases, another decreases/increases Negative covariance No relationship (independence) r = 0 Non-linear relationship

5 Covariance The variance shared by two variables When X and Y move in the same direction (i.e. their deviations from the mean are similarly pos or neg) cov (x,y) = pos. When X and Y move in opposite directions cov (x,y) = neg. When no constant relationship cov (x,y) = 0

6 Covariance is not very meaningful on its own and cannot be compared across different scales of measurement Solution: standardize this measure Pearson’s r:

7

8 Factors affecting Pearson r Linearity Heterogeneous subsamples Range restrictions Outliers

9 Linearity Nonlinear relationships will have an adverse effect on a measure designed to find a linear relationship

10 Heterogeneous subsamples Sub-samples may artificially increase or decrease overall r. Solution - calculate r separately for sub-samples & overall, look for differences

11 Range restriction Limiting the variability of your data can in turn limit the possibility for covariability between two variables, thus attenuating r. Common example occurs with Likert scales E.g. 1 - 4 vs. 1 - 9 However it is also the case that restricting the range can actually increase r if by doing so, highly influential data points would be kept out

12 Effect of Outliers Outliers can artificially and dramatically increase or decrease r Options Compute r with and without outliers Conduct robustified R! For example, recode outliers as having more conservative scores (winsorize) Transform variables

13 What else? r is the starting point for any regression and related method Both the slope and magnitude of residuals are reflective of r R = 0 slope =0 As such a lone r doesn’t really provide much more than a starting point for understanding the relationship between two variables

14 Robust Approaches to Correlation Rank approaches Winsorized Percentage Bend

15 Rank approaches: Spearman’s rho and Kendall’s tau Spearman’s rho is calculated using the same formula as Pearson’s r, but when variables are in the form of ranks Simply rank the data available X = 10 15 5 35 25 becomes X = 2 3 1 5 4 Do this for X and Y and calculate r as normal Kendall’s tau is a another rank based approach but the details of its calculation are different For theoretical reasons it may be preferable to Spearman’s, but both should be consistent for the most part and perform better than Pearson’s r when dealing with non-normal data

16 Winsorized Correlation As mentioned before, Winsorizing data involves changing some decided upon percentage of extreme scores to the value of the most extreme score (high and low) which is not Winsorized X = 1 2 3 4 5 6 becomes X = 2 2 3 4 5 5 Winsorize both X and Y values (without regard to each other) and compute Pearson’s r This has the advantage over rank-based approaches since the nature of the scales of measurement remain unchanged For theoretical reasons (recall some of our earlier discussion regarding the standard error for trimmed means) a Winsorized correlation would be preferable to trimming Though trimming is preferable for group comparisons

17 Methods Related to M-estimators The percentage bend correlation utilizes the median and a generalization of MAD A criticism of the Winsorized correlation is that the amount of Winsorizing is fixed in advance rather than determined by the data, and the r pb gets around that While the details can get a bit technical, you can get some sense of what is going on by relying on what you know regarding the robust approach in general With independent X and Y variables, the values of robust approaches to correlation will match the Pearson r With nonnormal data, the robust approaches described guard against outliers on the respective X and Y variables while Pearson’s r does not

18 Problem While these alternative methods help us in some sense, an issue remains When dealing with correlation, we are not considering the variables in isolation Outliers on one or the other variable, might not be a bivariate outlier Conversely what might be a bivariate outlier may not contain values that are outliers for X or Y themselves

19 Global measures of association Measures are available that take into account the bivariate nature of the situation Minimum Volume Ellipsoid Estimator (MVE) Minimum Covariance Determinant Estimator (MCD)

20 Minimum Volume Ellipsoid Estimator Robust elliptic plot (relplot) Relplots are like scatterplot boxplots for our data where the inner circle contains half the values and anything outside the dotted circle would be considered an outlier A strategy for robust estimation of correlation would be to find the ellipse with the smallest area that contains half the data points Those points are then used to calculate the correlation The MVE

21 Minimum Covariance Determinant Estimator The MCD is another alternative we might used and involves the notion of a generalized variance, which is a measure of the overall variability among a cloud of points For the more adventurous, see my /6810 page for info matrices and their determinants The determinant of a matrix is the generalized variance For the two variable situation As we can see, as r is a measure of linear association, the more tightly the points are packed the larger it would be, and subsequently smaller the generalized variance would be The MCD picks that half of the data which produces the smallest generalized variance, and calculates r from that

22 Global measures of association Note that both the MVE and MCD can be extended to situations with more than two variables We’d just be dealing with a larger matrix Example using the Robust library in S-Plus OMG! Drop down menus even!

23 Remaining issues: Curvature The fact is that straight lines may not capture the true story We may often fail to find noticeable relationships because our r, whichever method of “Pearsonesque” one we choose, is trying to specify a linear relationship There may still be a relationship, and a strong one, just more complex

24 Summary Correlation, in terms of Pearson r, gives us a sense of the strength of a linear association between two variables One data point can render it a useless measure, as it is not robust to outliers Measures which are robust are available, and some take into account the bivariate nature of the data However, curvilinear relationships may exist, and we should examine the data to see if alternative explanations are viable


Download ppt "Correlation Review and Extension. Questions to be asked… Is there a linear relationship between x and y? What is the strength of this relationship? Pearson."

Similar presentations


Ads by Google