Presentation on theme: "Correlation and Linear Regression"— Presentation transcript:
1 Correlation and Linear Regression Microbiology 3053Microbiological Procedures
2 CorrelationCorrelation analysis is used when you have measured two continuous variables and want to quantify how consistently they vary togetherThe stronger the correlation, the more likely to accurately estimate the value of one variable from the otherDirection and magnitude of correlation is quantified by Pearson’s correlation coefficient, rPerfectly negative (-1.00) to perfectly positive (1.00)No relationship (0.00)
3 Correlation The closer r = |1|, the stronger the relationship R=0 means that knowing the value of one variable tells us nothing about the value of the otherCorrelation analysis uses data that has already been collectedArchivalData not produced by experimentationCorrelation does not show cause and effect but may suggest such a relationship
4 Correlation ≠ Causation There is a strong, positive correlation betweenthe number of churches and bars in a townsmoking and alcoholism (consider the relationship between smoking and lung cancer)students who eat breakfast and school performancemarijuana usage and heroin addiction (vs heroin addiction and marijuana usage)
5 Visualizing Correlation Scatterplots are used to illustrate correlation analysisAssignment of axes does not matter (no independent and dependent variables)Order in which data pairs are plotted does not matterIn strict usage, lines are not drawn through correlation scatterplots
7 Linear RegressionUsed to measure the relationship between two variablesPrediction and a cause and effect relationshipDoes one variable change in a consistent manner with another variable?x = independent variable (cause)y = dependent variable (effect)If it is not clear which variable is the cause and which is the effect, linear regression is probably an inappropriate test
8 Linear Regression Calculated from experimental data Independent variable is under the control of the investigator (exact value)Dependent variable is normally distributedDiffers from correlation, where both variables are normally distributed and selected at random by investigatorRegression analysis with more than one independent variable is termed multiple (linear) regression
9 Linear RegressionBest fit line based on the sum of the squares of the distance of the data points from the predicted values (on the line)
10 Linear Regression y = a + bx where a = y intercept (point where x = 0 and the line passes through the y-axis)b = slope of the line (y2-y1/x2-x1)The slope indicates the nature of the correlationPositive = y increases as x increasesNegative = y decreases as x increases0 = no correlationSame as Pearson’s correlationNo relationship between the variables
11 Correlation Coefficient (r) Shows the strength of the linear relationship between two variables, symbolized by rThe closer the data points are to the line, the closer the regression value is to 1 or -1r varies between -1 (perfect negative correlation) to 1 (perfect positive correlation)no or very weak associationweak associationmoderate associationstrong associationvery strong to perfect associationnull hypothesis is no association (r = 0)Salkind, N. J. (2000) Statistics for people who think they hate statistics. Thousand Oaks, CA: Sage
12 Coefficient of Determination (r2) Used to estimate the extent to which the dependent variable (y) is under the influence of the independent variable (x)r2 (the square of the correlation coefficient)Varies from 0 to 1r2 = 1 means that the value of y is completely dependent on x (no error or other contributing factors)r2 < 1 indicates that the value of y is influenced by more than the value of x
13 Coefficient of Determination A measurement of the proportion of variance of y explained by its dependence on xRemainder (1 - r2) is the variance of y that is not explained by x (i.e., error or other factors)e.g., if r2 = 0.84, it shows a strong, positive relationship between the variables and shows that the value of x is used to predict 84% of the variability of y (and 16% is due to other factors)r2 can be calculated for correlation analysis by squaring r butNot a measure of variation of y explained by variation in xVariation in y is associated with the variance of x (and vice versa)
14 Assumptions of Linear Regression Independent variable (x) is selected by investigator (not random) and has no associated varianceFor every value of x, values of y have a normal distributionObserved values of y differ from the mean value of y by an amount called a residual. (Residuals are normally distributed.)The variances of y for all values of x are equal (homoscedasticity)Observations are independent (Each individual in the sample is only measured once.)
15 Linear Regression Data The numbers alone do not guarantee that the data have been fitted well!Anscombe, F. J Graphs in Statistical Analysis. The American Statistician 27(1):17-21.
17 Linear Regression Data Figure 1: Acceptable regression model with observations distributed evenly around the regression lineFigure 2: Strong curvature suggests that linear regression may not be appropriate (an additional variable may be required)
18 Linear Regression Data Figure 3: A single outlier alters the slope of the line. The point may be erroneous but if not, a different test may be necessaryFigure 4: Actually a regression line connecting only two points. If the rightmost point was different, the regression line would shift.
19 What if we’re not sure if linear regression is appropriate?
20 Residuals “Funnel” shaped and may be bowed Variance appears random HomoscedasticHeteroscedastic“Funnel” shaped and may be bowedSuggests that a transformation and inclusion of additional variables may be warrantedVariance appears randomGood regression modelHelsel, D.R., and R.M. Hirsh Statistical Methods in Water Resources. USGS (
22 Outliers Values that appear very different from others in the data set Rule of thumb: an outlier is more than three standard deviations from meanThree causesMeasurement or recording errorObservation from a different populationA rare event from within the populationOutliers need to be considered and not simply dismissedMay indicate important phenomenone.g., ozone hole data (outliers removed automatically by analysis program, delaying observation about 10 years)
23 OutliersHelsel, D.R., and R.M. Hirsh Statistical Methods in Water Resources. USGS (
24 When is Linear Regression Appropriate? Data should be interval or ratioThe dependent and independent variables should be identifiableThe relationship between variables should be linear (if not, a transformation might be appropriate)Have you chosen the values of the independent variable?Does the residual plot show a random spread (homoscedastic) and does the normal probability plot display a straight line (or does a histogram of residuals show a normal distribution)?
25 (Normal Probability Plot of Residuals) The normal probability plot indicates whether the residuals follow a normal distribution, in which case the points will follow a straight line. Expect some moderate scatter even with normal data. Look only for definite patterns like an "S-shaped" curve, which indicates that a transformation of the response may provide a better analysis. (from Design Expert 7.0 from Stat-Ease)
27 The Michaelis-Menton equation to describe enzyme activity: Lineweaver-Burk PlotThe Michaelis-Menton equation to describe enzyme activity:is linearized by taking its reciprocal:where: y = 1/vox = 1/[S]a = 1/Vmaxb = Km/Vmax