Correlation Review and Extension. Questions to be asked… Is there a linear relationship between x and y? What is the strength of this relationship? Pearson.

Slides:



Advertisements
Similar presentations
Correlation Oh yeah!.
Advertisements

Correlation and Linear Regression.
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Data Analysis: Bivariate Correlation and Regression CHAPTER sixteen.
Learning Objectives 1 Copyright © 2002 South-Western/Thomson Learning Data Analysis: Bivariate Correlation and Regression CHAPTER sixteen.
Correlation Mechanics. Covariance The variance shared by two variables When X and Y move in the same direction (i.e. their deviations from the mean are.
Education 793 Class Notes Joint Distributions and Correlation 1 October 2003.
Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.
Correlation. Introduction Two meanings of correlation –Research design –Statistical Relationship –Scatterplots.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Designing Experiments In designing experiments we: Manipulate the independent.
PPA 501 – Analytical Methods in Administration Lecture 8 – Linear Regression and Correlation.
Lecture 4: Correlation and Regression Laura McAvinue School of Psychology Trinity College Dublin.
Chapter Eighteen MEASURES OF ASSOCIATION
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 6: Correlation.
Educational Research by John W. Creswell. Copyright © 2002 by Pearson Education. All rights reserved. Slide 1 Chapter 8 Analyzing and Interpreting Quantitative.
Measures of Association Deepak Khazanchi Chapter 18.
Chapter 7 Correlational Research Gay, Mills, and Airasian
Linear Regression 2 Sociology 5811 Lecture 21 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Correlation and Regression
Lecture 16 Correlation and Coefficient of Correlation
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
Equations in Simple Regression Analysis. The Variance.
STAT 211 – 019 Dan Piett West Virginia University Lecture 2.
Simple Linear Regression
Simple Covariation Focus is still on ‘Understanding the Variability” With Group Difference approaches, issue has been: Can group membership (based on ‘levels.
CORRELATION & REGRESSION
Introduction to Regression Analysis. Two Purposes Explanation –Explain (or account for) the variance in a variable (e.g., explain why children’s test.
Chapter 14 – Correlation and Simple Regression Math 22 Introductory Statistics.
Is there a relationship between the lengths of body parts ?
Chapter 15 Correlation and Regression
Regression Analysis. Scatter plots Regression analysis requires interval and ratio-level data. To see if your data fits the models of regression, it is.
Chapter 12 Examining Relationships in Quantitative Research Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
Copyright © 2012 Pearson Education. Chapter 23 Nonparametric Methods.
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
Experimental Research Methods in Language Learning Chapter 11 Correlational Analysis.
Association between 2 variables
B AD 6243: Applied Univariate Statistics Correlation Professor Laku Chidambaram Price College of Business University of Oklahoma.
When trying to explain some of the patterns you have observed in your species and community data, it sometimes helps to have a look at relationships between.
Correlation & Regression
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
Examining Relationships in Quantitative Research
By: Amani Albraikan.  Pearson r  Spearman rho  Linearity  Range restrictions  Outliers  Beware of spurious correlations….take care in interpretation.
The Robust Approach Dealing with real data. Estimating Population Parameters Four properties are considered desirable in a population estimator:  Sufficiency.
Correlation & Regression Chapter 15. Correlation It is a statistical technique that is used to measure and describe a relationship between two variables.
Chapter 16 Data Analysis: Testing for Associations.
Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,
Describing Relationships Using Correlations. 2 More Statistical Notation Correlational analysis requires scores from two variables. X stands for the scores.
Psychology 820 Correlation Regression & Prediction.
Examining Relationships in Quantitative Research
Chapter Thirteen Copyright © 2006 John Wiley & Sons, Inc. Bivariate Correlation and Regression.
Robust Estimators.
Chapter 14 Correlation and Regression
Correlation They go together like salt and pepper… like oil and vinegar… like bread and butter… etc.
Correlation Chapter 6. What is a Correlation? It is a way of measuring the extent to which two variables are related. It measures the pattern of responses.
Chapter 16: Correlation. So far… We’ve focused on hypothesis testing Is the relationship we observe between x and y in our sample true generally (i.e.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
LESSON 6: REGRESSION 2/21/12 EDUC 502: Introduction to Statistics.
Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Correlation & Simple Linear Regression Chung-Yi Li, PhD Dept. of Public Health, College of Med. NCKU 1.
Regression Analysis.
Correlation analysis is undertaken to define the strength an direction of a linear relationship between two variables Two measurements are use to assess.
Is there a relationship between the lengths of body parts?
Psych 706: stats II Class #4.
Non-Parametric Tests 12/1.
Non-Parametric Tests 12/6.
Non-Parametric Tests.
CHAPTER 26: Inference for Regression
The greatest blessing in life is
Inferential Statistics
Making Inferences about Slopes
Presentation transcript:

Correlation Review and Extension

Questions to be asked… Is there a linear relationship between x and y? What is the strength of this relationship? Pearson Product Moment Correlation Coefficient (r) Can we describe this relationship and use this to predict y from x? y=bx+a Is the relationship we have described statistically significant? Not a very interesting one if tested against a null of r = 0

Other stuff Check scatterplots to see whether a Pearson r makes sense Use both r and r 2 to understand the situation If data is non-metric or non-normal, use “non- parametric” correlations Correlation does not prove causation True relationship may be in opposite direction, co- causal, or due to other variables However, correlation is the primary statistic used in making an assessment of causality ‘Potential’ Causation

Possible outcomes -1 to +1 As one variable increases/decreases, the other variable increases/decreases Positive covariance As one variable increases/decreases, another decreases/increases Negative covariance No relationship (independence) r = 0 Non-linear relationship

Covariance The variance shared by two variables When X and Y move in the same direction (i.e. their deviations from the mean are similarly pos or neg) cov (x,y) = pos. When X and Y move in opposite directions cov (x,y) = neg. When no constant relationship cov (x,y) = 0

Covariance is not very meaningful on its own and cannot be compared across different scales of measurement Solution: standardize this measure Pearson’s r:

Factors affecting Pearson r Linearity Heterogeneous subsamples Range restrictions Outliers

Linearity Nonlinear relationships will have an adverse effect on a measure designed to find a linear relationship

Heterogeneous subsamples Sub-samples may artificially increase or decrease overall r. Solution - calculate r separately for sub-samples & overall, look for differences

Range restriction Limiting the variability of your data can in turn limit the possibility for covariability between two variables, thus attenuating r. Common example occurs with Likert scales E.g vs However it is also the case that restricting the range can actually increase r if by doing so, highly influential data points would be kept out

Effect of Outliers Outliers can artificially and dramatically increase or decrease r Options Compute r with and without outliers Conduct robustified R! For example, recode outliers as having more conservative scores (winsorize) Transform variables

What else? r is the starting point for any regression and related method Both the slope and magnitude of residuals are reflective of r R = 0 slope =0 As such a lone r doesn’t really provide much more than a starting point for understanding the relationship between two variables

Robust Approaches to Correlation Rank approaches Winsorized Percentage Bend

Rank approaches: Spearman’s rho and Kendall’s tau Spearman’s rho is calculated using the same formula as Pearson’s r, but when variables are in the form of ranks Simply rank the data available X = becomes X = Do this for X and Y and calculate r as normal Kendall’s tau is a another rank based approach but the details of its calculation are different For theoretical reasons it may be preferable to Spearman’s, but both should be consistent for the most part and perform better than Pearson’s r when dealing with non-normal data

Winsorized Correlation As mentioned before, Winsorizing data involves changing some decided upon percentage of extreme scores to the value of the most extreme score (high and low) which is not Winsorized X = becomes X = Winsorize both X and Y values (without regard to each other) and compute Pearson’s r This has the advantage over rank-based approaches since the nature of the scales of measurement remain unchanged For theoretical reasons (recall some of our earlier discussion regarding the standard error for trimmed means) a Winsorized correlation would be preferable to trimming Though trimming is preferable for group comparisons

Methods Related to M-estimators The percentage bend correlation utilizes the median and a generalization of MAD A criticism of the Winsorized correlation is that the amount of Winsorizing is fixed in advance rather than determined by the data, and the r pb gets around that While the details can get a bit technical, you can get some sense of what is going on by relying on what you know regarding the robust approach in general With independent X and Y variables, the values of robust approaches to correlation will match the Pearson r With nonnormal data, the robust approaches described guard against outliers on the respective X and Y variables while Pearson’s r does not

Problem While these alternative methods help us in some sense, an issue remains When dealing with correlation, we are not considering the variables in isolation Outliers on one or the other variable, might not be a bivariate outlier Conversely what might be a bivariate outlier may not contain values that are outliers for X or Y themselves

Global measures of association Measures are available that take into account the bivariate nature of the situation Minimum Volume Ellipsoid Estimator (MVE) Minimum Covariance Determinant Estimator (MCD)

Minimum Volume Ellipsoid Estimator Robust elliptic plot (relplot) Relplots are like scatterplot boxplots for our data where the inner circle contains half the values and anything outside the dotted circle would be considered an outlier A strategy for robust estimation of correlation would be to find the ellipse with the smallest area that contains half the data points Those points are then used to calculate the correlation The MVE

Minimum Covariance Determinant Estimator The MCD is another alternative we might used and involves the notion of a generalized variance, which is a measure of the overall variability among a cloud of points For the more adventurous, see my /6810 page for info matrices and their determinants The determinant of a matrix is the generalized variance For the two variable situation As we can see, as r is a measure of linear association, the more tightly the points are packed the larger it would be, and subsequently smaller the generalized variance would be The MCD picks that half of the data which produces the smallest generalized variance, and calculates r from that

Global measures of association Note that both the MVE and MCD can be extended to situations with more than two variables We’d just be dealing with a larger matrix Example using the Robust library in S-Plus OMG! Drop down menus even!

Remaining issues: Curvature The fact is that straight lines may not capture the true story We may often fail to find noticeable relationships because our r, whichever method of “Pearsonesque” one we choose, is trying to specify a linear relationship There may still be a relationship, and a strong one, just more complex

Summary Correlation, in terms of Pearson r, gives us a sense of the strength of a linear association between two variables One data point can render it a useless measure, as it is not robust to outliers Measures which are robust are available, and some take into account the bivariate nature of the data However, curvilinear relationships may exist, and we should examine the data to see if alternative explanations are viable