Presentation is loading. Please wait.

Presentation is loading. Please wait.

Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Similar presentations


Presentation on theme: "Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,"— Presentation transcript:

1 Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related, when x is high, y tends to be high, and when x is low, y tends to be low. If they are indirectly related, when x is high, y tends to be low, and when x is low, y tends to be high. This suggests summing the products of the z- scores.

2 17.764.2 210.965.7 310.364.3 48.764.3 59.664.6 610.365.8 713.667.9 812.867.5 98.763.5 1013.066.6 1110.764.1 129.963.9 1310.765.4 149.866.4 159.964.2 1611.566.1 1711.465.5 1811.466.8 1910.764.2 208.864.4 xy case Scatter plot of y vs. x

3 x (zscore) y (zscore) case 17.764.2-1.9-0.81.55 210.965.70.3 0.08 310.364.3-0.1-0.80.11 48.764.3-1.2-0.80.91 59.664.6-0.6-0.50.32 610.365.8-0.10.4-0.06 713.667.92.0 4.17 812.867.51.51.72.62 98.763.5-1.2-1.41.66 1013.066.61.61.01.70 1110.764.10.1-0.9-0.11 129.963.9-0.4-1.10.44 1310.765.40.1 0.01 149.866.4-0.50.9-0.42 159.964.2-0.4-0.80.34 1611.566.10.6 0.42 1711.465.50.60.20.10 1811.466.80.61.20.69 1910.764.20.1-0.8-0.10 208.864.4-1.1-0.70.77 mean10.565.30.0 0.8 xy xzyzxzyz Correlation coefficient (Pearson’s) r

4 Properties of r r ranges from -1.0 to +1.0 r = 1 means perfect linear relationship r = -1 means perfect linear relationship with negative slope r = 0 means no correlation

5 r = 0. 8 0 r = 0. 3 5 Example scatterplots r = -.24r =.41 r = 0 r = -.66r =.94

6 Correlation and causation “Correlation does not imply causation” More precisely, x correlated with y does not imply x causes y, because correlation could be a type I error y could cause x z could cause both x and y

7 Uncorrelated does not mean independent x y x is highly predictive of y, but r = 0

8 Significance test for r The aim is to test the null hypothesis that the population correlation ρ (rho) is 0. The larger n, the less likely a given r will happen under the null hypothesis. From r and n, we can compute a p-value From n and α, we can compute a critical r Numerical example

9 Regression Correlation suggests a linear model of y as a function of x A linear model is defined by ŷ = mx + b + e random error with mean 0equation for a line slopeinterceptpredicted y

10 x regression y residuals x e R 2 = 0.5336, F = 20.5949, p = 0.0003 Regression line: y = -1.18 x + 18.77

11 r vs. R 2 R 2 is actually the square of r. So why is it capitalized and squared in a regression? r ranges from -1 to 1. But in a regression, r cannot meaningfully be negative, because it is the correlation between y and ŷ. Since ŷ is the best estimate of y, this correlation is automatically positive. The capitalization and squaring reflects this situation. It is squared to

12 Interpretation of R2 R 2 can be interpreted as the proportion of the variance accounted for R 2 = 1 - SS error SS total SS reg SS total = regression line mean R 2 is high when the unexplained (residual) variance is small relative to the total amount of variance

13 Simpson’s paradox Size of animal length of ears Negatively correlated Or positively correlated? Rabbits Humans Whales Adding a variable can change the sign of the correlation

14 Effect size Beyond computing significance, we often need an estimate of the magnitude of an effect. There are two basic ways of expressing this: - Normalized mean difference - Proportion of variance accounted for

15 The normalized difference between means Cohen’s d expresses how the difference between two means relative to the spread in the data.

16 Proportion of variance accounted for R 2 can be interpreted as the proportion of all the variance in the data that is predicted by a regression model η 2 (eta squared) can be interpreted as the proportion of all variance in a factorial design that is accounted for by main effects and interactions

17 Power Power is the probability of finding an effect of a given size in your experiment, i.e. The probability of rejecting the null hypothesis if the null hypothesis is actually false.

18

19 Outliers An outlier is a measurement that is so discrepant from others that it seems “suspicious.” If p(x suspicious |distribution) is low enough, we “reject the null hypothesis” that x suspicious came from the same distribution as the others, and remove it. A common rule of thumb is z > 2.5 (or 2 or 3), BUT... But also consider transforms that avoid outliers in the first place, like 1/x. Removed data is best NOT REPLACED. But if it must be replaced, do so “conservatively,” i.e. in a manner biased towards the null hypothesis.

20 Chi squared Assume that mutually K exclusive outcomes are predicted to occur E 1,E 2,...,E K, times...but are actually observed to occur N 1,N 2,...,N K times respectively... A chi-square test allows us to evaluate the null hypothesis that the proportions were as expected, with deviations “by chance.”

21 Performing a chi-squared test For each outcome, compute Sum them up over all outcomes Then, under the null hypothesis, this total will be distributed as a χ 2 distribution with n-1 degrees of freedom.

22 The Bayesian perspective Conventional statistics is based on a frequentist definition of probability, which insists that hypotheses do not have “probabilities.” → All we can do is “reject” H, or not reject it. Bayesian inference is based on a subjectivist definition of probability, which considers p(H) to be the “degree of belief” in hypothesis H, simply expressing our uncertainty about H in light of the data. → Instead of accepting or rejecting, we seek p(H|E).

23 Cartoon 1: Fisher Fisher: Given the sampling distribution of the null p(E|H 0 ), consider the likelihood of the null hypothesis, integrated out to the tail. If this probability is low, this tends to contradict the null hypothesis. In fact, if it is lower than.05, we informally “reject” the null. 0 p(E|H0)p(E|H0) E probability density

24 Cartoon 2: Neyman & Pearson N&P: There are really two hypotheses, the null H 0 and some alternative H 1. Our main goal is to avoid a Type I error. So set this probability at α, which determines our criterion for rejecting the null. Note though that there is also a possibility of making a Type II error, a hit, or a correct rejection. Compute power and set sample size to control the probability of a Type II error. 0 p(E|H0)p(E|H0) Expected effect size p(E|H1)p(E|H1) μ1μ1 E probability density

25 Cartoon 3: Bayes (/Laplace/Jeffreys) What we really want is to evaluate how strongly our data favors either hypothesis, not just make an accept/reject decision. For each H, the degree of belief in it, conditioned on the data, is p(H|E). So to evaluate the relative strength of the H 1 and H 0, consider the posterior ratio This expresses how strongly the data and priors favor H 1 relative to H 0, taking into account everything we know about the situation. Degree of belief in H 1 Degree of belief in H 0 =

26 Decomposing the posterior ratio posterior ratio = prior ratio × likelihood ratio If you want to be “unbiased”, set the prior ratio to 1, sometimes called an “uninformative prior.” Then your posterior belief about H 0 and H 1 depends entirely on the likelihood ratio, aka “Bayes factor.”

27 Visualizing the Likelihood Ratio 0 Expected effect size μ1μ1 = p(E|H0)p(E|H0)p(E|H1)p(E|H1) height of green bar at E height of red bar at E E probability density

28 Interpretation of likelihood ratios LR = 1 means the evidence was neutral about which hypothesis was correct. LR > 1 means the evidence favors the hypothesis. Jeffreys (1939) suggested rules of thumb, e.g. LR > 3 means “substantial” evidence in favor of H 1, LR >10 means “strong,” evidence etc. LR < 1 means the evidence actually favored the null hypothesis.

29 LRs vs. p-values Likelihood ratios and p- values are not at all the same thing. But in practice, they are related. Dixon (1998)


Download ppt "Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,"

Similar presentations


Ads by Google