Presentation is loading. Please wait.

Presentation is loading. Please wait.

Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 10 – Correlation and linear regression: Introduction.

Similar presentations


Presentation on theme: "Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 10 – Correlation and linear regression: Introduction."— Presentation transcript:

1 Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 10 – Correlation and linear regression: Introduction to statistical models Marshall University Genomics Core Facility

2 Correlation Correlation describes the propensity for one variable to vary in the same (or opposite) way to another variable Example (from Motulsky): Borkmann et al. measured the insulin sensitivity and fraction of polyunsaturated fatty acids with between 20 and 22% carbon atoms in 13 healthy men Both variables show a degree of variation: Marshall University School of Medicine

3 Scatterplot Scatterplot of insulin sensitivity against %C20- 22: Marshall University School of Medicine The plot seems to show a relationship, or correlation, between the variables The higher the %C20-22, the higher the insulin sensitivity

4 Correlation Coefficient The correlation coefficient between two sets of values x 1 …x n and y 1 …y n is computed as follows: – Calculate the standardized values of x and y: – z x,i =(x i -mean(x))/sd(x); z y,i =(y i -mean(y))/sd(y); – Compute the products of all the standardized values, add them up, and divide by n-1: – r=(z x,1 z y,1 +z x,2 z y,2 +…+z x,n z y,n )/(n-1) Marshall University School of Medicine

5 Why the correlation coefficient works If a value is bigger than the mean, its standardized score is positive, otherwise its standardized score is negative The product of two standardized scores will be positive if both scores are positive, or both scores are negative i.e. if both scores are bigger than the mean, or both are less than the mean So if one variable tends to increase when the other tends to increase, the bulk of the products of standardized scores will be positive, and the correlation coefficient will be high On the other hand, if one variable tends to decrease when the other increases, the bulk of the products of standardized scores will be negative, and the correlation coefficient will be low If there is no relationship, the standardized scores will be randomly distributed, and their products will tend to cancel out Marshall University School of Medicine

6 Correlation coefficient for the insulin sensitivity data The correlation coefficient for the insulin sensitivity data is r=0.77 The square of this value is r 2 =0.59 r 2 is always between 0 and 1 r 2 is easier to interpret than r: 59% of the variation in insulin sensitivity can be "explained" by the variation in %C20-22. We will make this more precise later Marshall University School of Medicine

7 Confidence Intervals for Correlation Coefficients Most statistical software will compute a confidence interval for a correlation coefficient: 95% confidence interval for these data is [0.38, 0.93] We are 95% confident the interval from 0.38 to 0.93 includes the true correlation coefficient for insulin sensitivity and %C20-22 fatty acid content Marshall University School of Medicine

8 GRHL2 and Epithelial-Mesenchymal Transition Epithelial-Mesenchymal Transition (EMT) is a process cancer cells must undergo before metastasis can occur Mani et al (Cell 2008; 133; 704-15) published a gene signature for cells which have undergone EMT Relative expression of a set of 251 genes indicative of EMT Cieply et al. (Cancer Research, 2012) attempted to induce EMT in GRHL2-overexpressed cells, profiled the resulting gene expression by microarray Hypothesized that GRHL2 would suppress EMT Compared expression of Core EMT genes in their assay to that of Mani Marshall University School of Medicine

9 Expression of Core EMT genes in GRHL2 overexpressed cells Marshall University School of Medicine Expression patterns show a strong negative correlation Suggests that GRHL2 has suppressed EMT

10 p-values for correlation coefficients It is possible to compute a p-value for correlation coefficients The null hypothesis is that there is no correlation – i.e. that the true correlation coefficient is zero So the p-value is the probability of getting a correlation coefficient at least as big as the one observed, from a random sample of the same size as the one used, assuming there is no correlation in the population Note that with large samples, p-values for correlation coefficients tend to be very small For the insulin sensitivity example (n=13), p=0.0021 For the GRHL2-EMT example, (n=216), p<10 -16 It is important to look at the r or r 2 value to determine if the result is of biological importance Marshall University School of Medicine

11 Correlation and Causality A very common error is to assume that correlation implies causality In the insulin sensitivity example, it would be wrong to conclude from the correlation alone that high lipid content caused high insulin sensitivity The possible reasons for the correlation in this example are: – Lipid content determines insulin sensitivity – Insulin sensitivity determines lipid content – Both lipid content and insulin sensitivity are determined by a common factor – There is a complex network of interacting factors of which lipid content and insulin sensitivity are two components – It is a coincidence The p-value tells you how rare a coincidence would be, under the null hypothesis To determine among the other possibilities, further experimentation is needed Marshall University School of Medicine

12 Correlation and Causality in the Examples In the first example (insulin sensitivity), the investigators performed further experiments in which they manipulated the variables They concluded that lipid content determined insulin sensitivity (to some extent) In the second example, the data come from the same genes under different sets of conditions – There is no mechanism for the expression under one condition to affect the expression under another condition – They are both determined by a common factor (the extent to which the cell has undergone EMT) In the first example, it makes sense to investigate the nature of the influence of lipid content on insulin sensitivity further Marshall University School of Medicine

13 Simple Linear Regression Correlation asks the question "To what extent is there a linear relationship between two variables” Linear regression asks the question "What is the linear relationship between two variables” Correlation is symmetric – the correlation coefficient between x and y is the same as the correlation coefficient between y and x Linear regression is not symmetric: – One variable must be designated as independent and one must be designated as dependent – It assumes a model of causality – Switching the roles of independent and dependent variables will produce different results Marshall University School of Medicine

14 What does linear regression do? Linear regression calculates the straight line that gives the best prediction of the y values from the x values It finds the values of a and b in the equation y=a+bx to do this This is done by minimizing the sum of the squares of the vertical distance from each point to the line Note that: – The roles of x and y are predetermined, and affect the result – We can only estimate a and b based on our data sample – We cannot know the true population values for a and b – Usually helpful to calculate a confidence interval for these Does it make sense to perform linear regression on – The insulin sensitivity data? – The GRHL2-EMT expression data? Marshall University School of Medicine

15 Linear Regression for Insulin Sensitivity Marshall University School of Medicine

16 Linear Regression results for Insulin Sensitivity Marshall University School of Medicine

17 Interpreting linear regression results The best fit values show the slope and intercepts of the line, along with their standard errors So estimate of slope is 37.21 with standard error 9.3 – For each 1% increase in the percentage of polyunsaturated fatty acids with 20-22 Carbon atoms, the insulin sensitivity increases on average by 37.21 mg/m 2 /min The 95% confidence interval for the slope ranges from 16.75 to 57.67 – This is easier to interpret than the standard error – We are 95% confident the range 16.75 to 57.67 includes the true value of the slope The intercepts give the values of the insulin sensitivity when the %C20-22 is 0, and the value of %C20-22 that would yield an insulin sensitivity of 0 – Are these meaningful? The R 2 value is 0.5929. This means that 59% of the variance in insulin sensitivity can be accounted for by the variation in C20-22 polyunsaturated fatty acids, and the remaining 41% is the result of other factors. – We will discuss R 2 in more detail later Marshall University School of Medicine

18 p-value for linear regression The linear regression results give a p-value of 0.0021 To interpret this, we need to know the null hypothesis The null hypothesis is that there really is no linear relationship between insulin sensitivity and %C20-22. – If this were true, the best fit line would have a slope of zero If the null hypothesis were true (there is no linear relationship between the two), the chances of seeing a best fit line with a slope at least this steep would be 0.21% Note that the null hypothesis for correlation is essentially equivalent to the null hypothesis for linear regression – Hence the p-values are equal – However, the interpretations are different Marshall University School of Medicine

19 Assumptions for linear regression Linear regression is based on the following assumptions: – There is a linear relationship between the two quantities – The residuals are normally distributed The residuals are the vertical distances of each point from the line; the random scatter – The variability is the same all the way along the line – Data points are independent – The x and y values are measured independently – The x values are known precisely Be careful of the following: – Do not try to interpret the linear regression for values far from the data – In our example, the %C20-22 values were all between 17 and 25. The linear regression is unlikely to be meaningful for values far from this. In particular, the intercept value (%C20-22=0) is likely to be meaningless. Marshall University School of Medicine

20 Common mistakes with linear regression Be careful of the following traps when using Linear Regression: – Not all relationships are linear! If the R 2 value for linear regression is low, consider the possibility there may be another relationship between the variables – Don't use linear regression on smoothed data This violates the assumption that data points are independent – Don't use linear regression if y is (partly) calculated from x For example, if y is the change in a measurement before and after treatment, and x is the value before treatment This violates the assumptions that x and y are measured independently Always carefully consider which variable is x and which is y – If you can't decide, you probably shouldn't be using regression Always plot the data Marshall University School of Medicine

21 Summary: Correlation Correlation determines the extent to which two variables share a linear relationship Makes no assumptions and draws no conclusions about causality The correlation coefficient is between -1 and 1, with ±1 being a perfect linear relationship The square of the correlation coefficient is the percentage of variability in one variable which is "explained by" the variability in the other variable Marshall University School of Medicine

22 Summary: Linear Regression Linear Regression provides the best prediction of one variable from another variable, assuming they have a linear relationship Causal direction is built into the model Results give estimates for two parameters: – Intercept and slope and confidence intervals for each Marshall University School of Medicine

23 Introduction to Statistical Models Marshall University School of Medicine

24 What is a model? In general, a model is a (simpler) representation of something else – We use models to study complex phenomena – Easier to manipulate than the real thing of interest – Easier to focus on specific aspects – E.g. we use mouse models to study human disease Easier to control behavior of the mouse Easier to control genetics… Marshall University School of Medicine

25 What is a mathematical model? A mathematical model is an equation (or set of equations) that describes a physical state or process – Describes how values in the state or process are related to each other Aim is not to provide a perfect model – A good model is simple enough to be easy to understand – Yet complex enough to be useful Marshall University School of Medicine

26 Statistical Models Statistical models are mathematical models that model both the ideal predictions and the random “scatter” or “noise” – Model both the population values and the “random” variation from the population values “Random” variation is really just variation not explained or accounted for by the model Marshall University School of Medicine

27 Model terminology A model is an equation (or set of equations) The equation defines the outcome, or dependent variable as a function of – one or more independent variables, and – one or more parameters Each data point has its own values for the independent and dependent variables The values of the parameters are properties of the population – Do not vary from data point to data point Marshall University School of Medicine

28 Fitting a model to data The parameters are properties of the population – They are unknown Typically, we collect a sample of data points Assuming the model is correct, we can use the sample to estimate the parameters of the model – This is called “fitting a model to the data” – Results in estimates and confidence intervals for each of the parameters Marshall University School of Medicine

29 Simplest possible model The simplest possible model for a data set involves no independent variable! Sample values from a population Assume the population values follow a Normal distribution Our model is Marshall University School of Medicine

30 Average as a model In the simple model Y=μ+ε, – Y is the dependent variable Different value for each data point – μ is a parameter The mean of the population Single, unknown value we will estimate from our data – ε is the “random error” Different for each data point, assumed normally distributed with mean zero Can make the roles of the variable types more explicit by writing Y i =μ+ε i Marshall University School of Medicine

31 Why the mean is important If we assume the model is correct: – Our data are sampled from a population where the values are some fixed value, plus some scatter that is normally distributed with mean zero then we want to use our data to estimate μ It turns out that the value of μ that makes our observed data the most likely, out of all possible choices of μ, is the mean of our data – The mean is the maximum likelihood estimate of μ Marshall University School of Medicine

32 A more sophisticated model: linear regression Revisit the example from linear regression: – Measured insulin sensitivity and %C20-22 content in 13 healthy men – Hypothesized that an increase in %C20-22 content caused an increase in insulin sensitivity – Used linear regression to fit the model Y = intercept + slope × X + scatter to the data Y is the insulin sensitivity, X the %C20-22 content In more conventional notation: Y = β 0 + β 1 × X + ε, or Y i = β 0 + β 1 × X i + ε i Marshall University School of Medicine

33 Linear regression as a statistical model The linear regression model has two parameters: – β 0, the intercept – β 1, the slope These are both properties of the population We use the data to estimate them – Uses the method of “least squares” – Gives the maximum likelihood estimate for the two parameters – The values of the parameters that maximize the chances of our data being observed Marshall University School of Medicine

34 Recap of models The linear regression in this example gave an estimate of the slope of 37.2, and an estimate of the intercept of -486.5 Our estimated model is Insulin sensitivity = 37.2 × %C20-22 - 486.5 + ε The model is not assumed to be perfect! Simple, but powerful enough to draw some basic conclusions – Within the range of the data, an increase in one unit in %C20-22 results, on average, in an increase in 37.2 units in insulin sensitivity Marshall University School of Medicine

35 Other types of model We will look at other types of model in upcoming lectures: – Multiple regression More than one independent variable – Logistic regression Outcome variable is binary, one or more independent variables – Proportional hazards regression Outcome variable is survival time, one or more independent variables Marshall University School of Medicine


Download ppt "Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 10 – Correlation and linear regression: Introduction."

Similar presentations


Ads by Google