# Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 9 – Correlation and linear regression Marshall University.

## Presentation on theme: "Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 9 – Correlation and linear regression Marshall University."— Presentation transcript:

Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 9 – Correlation and linear regression Marshall University Genomics Core Facility

Correlation Correlation describes the propensity for one variable to vary in the same (or opposite) way to another variable Example (from Motulsky): Borkmann et al. measured the insulin sensitivity and fraction of polyunsaturated fatty acids with between 20 and 22% carbon atoms in 13 healthy men Both variables show a degree of variation: Marshall University School of Medicine

Scatterplot Scatterplot of insulin sensitivity against %C20- 22: Marshall University School of Medicine The plot seems to show a relationship, or correlation, between the variables The higher the %C20-22, the higher the insulin sensitivity

Correlation Coefficient The correlation coefficient between two sets of values x 1 …x n and y 1 …y n is computed as follows: – Calculate the standardized values of x and y: – z x,i =(x i -mean(x))/sd(x); z y,i =(y i -mean(y))/sd(y); – Compute the products of all the standardized values, add them up, and divide by n-1: – r=(z x,1 z y,1 +z x,2 z y,2 +…+z x,n z y,n )/(n-1) Marshall University School of Medicine

Why the correlation coefficient works If a value is bigger than the mean, its standardized score is positive, otherwise its standardized score is negative The product of two standardized scores will be positive if both scores are positive, or both scores are negative i.e. if both scores are bigger than the mean, or both are less than the mean So if one variable tends to increase when the other tends to increase, the bulk of the products of standardized scores will be positive, and the correlation coefficient will be high On the other hand, if one variable tends to decrease when the other increases, the bulk of the products of standardized scores will be negative, and the correlation coefficient will be low If there is no relationship, the standardized scores will be randomly distributed, and their products will tend to cancel out Marshall University School of Medicine

Correlation coefficient for the insulin sensitivity data The correlation coefficient for the insulin sensitivity data is r=0.77 The square of this value is r 2 =0.59 r 2 is always between 0 and 1 r 2 is easier to interpret than r: 59% of the variation in insulin sensitivity can be "explained" by the variation in %C20-22. We will make this more precise later Marshall University School of Medicine

Confidence Intervals for Correlation Coefficients Most statistical software will compute a confidence interval for a correlation coefficient: 95% confidence interval for these data is [0.38, 0.93] We are 95% confident the interval from 0.38 to 0.93 includes the true correlation coefficient for insulin sensitivity and %C20-22 fatty acid content Marshall University School of Medicine

GRHL2 and Epithelial-Mesenchymal Transition Epithelial-Mesenchymal Transition (EMT) is a process cancer cells must undergo before metastasis can occur Mani et al (Cell 2008; 133; 704-15) published a gene signature for cells which have undergone EMT Relative expression of a set of 251 genes indicative of EMT Cieply et al. (Cancer Research, 2012) attempted to induce EMT in GRHL2-overexpressed cells, profiled the resulting gene expression by microarray Hypothesized that GRHL2 would suppress EMT Compared expression of Core EMT genes in their assay to that of Mani Marshall University School of Medicine

Expression of Core EMT genes in GRHL2 overexpressed cells Marshall University School of Medicine Expression patterns show a strong negative correlation Suggests that GRHL2 has suppressed EMT

p-values for correlation coefficients It is possible to compute a p-value for correlation coefficients The null hypothesis is that there is no correlation – i.e. that the true correlation coefficient is zero So the p-value is the probability of getting a correlation coefficient at least as big as the one observed, from a random sample of the same size as the one used, assuming there is no correlation in the population Note that with large samples, p-values for correlation coefficients tend to be very small For the insulin sensitivity example (n=13), p=0.0021 For the GRHL2-EMT example, (n=216), p<10 -16 It is important to look at the r or r 2 value to determine if the result is of biological importance Marshall University School of Medicine

Correlation and Causality A very common error is to assume that correlation implies causality In the insulin sensitivity example, it would be wrong to conclude from the correlation alone that high lipid content caused high insulin sensitivity The possible reasons for the correlation in this example are: – Lipid content determines insulin sensitivity – Insulin sensitivity determines lipid content – Both lipid content and insulin sensitivity are determined by a common factor – There is a complex network of interacting factors of which lipid content and insulin sensitivity are two components – It is a coincidence The p-value tells you how rare a coincidence would be, under the null hypothesis To determine among the other possibilities, further experimentation is needed Marshall University School of Medicine

Correlation and Causality in the Examples In the first example (insulin sensitivity), the investigators performed further experiments in which they manipulated the variables They concluded that lipid content determined insulin sensitivity (to some extent) In the second example, the data come from the same genes under different sets of conditions – There is no mechanism for the expression under one condition to affect the expression under another condition – They are both determined by a common factor (the extent to which the cell has undergone EMT) In the first example, it makes sense to investigate the nature of the influence of lipid content on insulin sensitivity further Marshall University School of Medicine

Simple Linear Regression Correlation asks the question "To what extent is there a linear relationship between two variables” Linear regression asks the question "What is the linear relationship between two variables” Correlation is symmetric – the correlation coefficient between x and y is the same as the correlation coefficient between y and x Linear regression is not symmetric: – One variable must be designated as independent and one must be designated as dependent – It assumes a model of causality – Switching the roles of independent and dependent variables will produce different results Marshall University School of Medicine

What does linear regression do? Linear regression calculates the straight line that gives the best prediction of the y values from the x values It finds the values of a and b in the equation y=a+bx to do this This is done by minimizing the sum of the squares of the vertical distance from each point to the line Note that: – The roles of x and y are predetermined, and affect the result – We can only estimate a and b based on our data sample – We cannot know the true population values for a and b – Usually helpful to calculate a confidence interval for these Does it make sense to perform linear regression on – The insulin sensitivity data? – The GRHL2-EMT expression data? Marshall University School of Medicine

Linear Regression for Insulin Sensitivity Marshall University School of Medicine

Linear Regression results for Insulin Sensitivity Marshall University School of Medicine

Interpreting linear regression results The best fit values show the slope and intercepts of the line, along with their standard errors So estimate of slope is 37.21 with standard error 9.3 – For each 1% increase in the percentage of polyunsaturated fatty acids with 20-22 Carbon atoms, the insulin sensitivity increases on average by 37.21 mg/m 2 /min The 95% confidence interval for the slope ranges from 16.75 to 57.67 – This is easier to interpret than the standard error – We are 95% confident the range 16.75 to 57.67 includes the true value of the slope The intercepts give the values of the insulin sensitivity when the %C20-22 is 0, and the value of %C20-22 that would yield an insulin sensitivity of 0 – Are these meaningful? The R 2 value is 0.5929. This means that 59% of the variance in insulin sensitivity can be accounted for by the variation in C20-22 polyunsaturated fatty acids, and the remaining 41% is the result of other factors. – We will discuss R 2 in more detail later Marshall University School of Medicine

p-value for linear regression The linear regression results give a p-value of 0.0021 To interpret this, we need to know the null hypothesis The null hypothesis is that there really is no linear relationship between insulin sensitivity and %C20-22. – If this were true, the best fit line would have a slope of zero If the null hypothesis were true (there is no linear relationship between the two), the chances of seeing a best fit line with a slope at least this steep would be 0.21% Note that the null hypothesis for correlation is essentially equivalent to the null hypothesis for linear regression – Hence the p-values are equal – However, the interpretations are different Marshall University School of Medicine

Assumptions for linear regression Linear regression is based on the following assumptions: – There is a linear relationship between the two quantities – The residuals are normally distributed The residuals are the vertical distances of each point from the line; the random scatter – The variability is the same all the way along the line – Data points are independent – The x and y values are measured independently – The x values are known precisely Be careful of the following: – Do not try to interpret the linear regression for values far from the data – In our example, the %C20-22 values were all between 17 and 25. The linear regression is unlikely to be meaningful for values far from this. In particular, the intercept value (%C20-22=0) is likely to be meaningless. Marshall University School of Medicine

Common mistakes with linear regression Be careful of the following traps when using Linear Regression: – Not all relationships are linear! If the R 2 value for linear regression is low, consider the possibility there may be another relationship between the variables – Don't use linear regression on smoothed data This violates the assumption that data points are independent – Don't use linear regression if y is (partly) calculated from x For example, if y is the change in a measurement before and after treatment, and x is the value before treatment This violates the assumptions that x and y are measured independently Always carefully consider which variable is x and which is y – If you can't decide, you probably shouldn't be using regression Always plot the data Marshall University School of Medicine

Summary: Correlation Correlation determines the extent to which two variables share a linear relationship Makes no assumptions and draws no conclusions about causality The correlation coefficient is between -1 and 1, with ±1 being a perfect linear relationship The square of the correlation coefficient is the percentage of variability in one variable which is "explained by" the variability in the other variable Marshall University School of Medicine

Summary: Linear Regression Linear Regression provides the best prediction of one variable from another variable, assuming they have a linear relationship Causal direction is built into the model Results give estimates for two parameters: – Intercept and slope and confidence intervals for each Marshall University School of Medicine

Download ppt "Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 9 – Correlation and linear regression Marshall University."

Similar presentations