University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 1 Data analysis project Proposal must be approved.

Slides:



Advertisements
Similar presentations
Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
Advertisements

Forecasting Using the Simple Linear Regression Model and Correlation
Hypothesis Testing Steps in Hypothesis Testing:
Inference for Regression
Multiple regression. Problem: to draw a straight line through the points that best explains the variance Regression.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Generalized Linear Models (GLM)
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.1 CorrelationCorrelation The underlying principle of correlation analysis.
Correlation and Regression. Spearman's rank correlation An alternative to correlation that does not make so many assumptions Still measures the strength.
Lecture 15 – Tues., Oct. 28 Review example of one-way layout Simple Linear Regression: –Simple Linear Regression Model, 7.2 –Least Squares Regression Estimation,
Chapter 12 Simple Regression
Lecture 16 – Thurs., March 4 Chi squared test for M&M experiment Simple linear regression (Chapter 7.2) Next class after spring break: Inference for simple.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
The Simple Regression Model
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 13-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
SIMPLE LINEAR REGRESSION
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Pengujian Parameter Koefisien Korelasi Pertemuan 04 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
Chapter Topics Types of Regression Models
REGRESSION AND CORRELATION
Introduction to Probability and Statistics Linear Regression and Correlation.
SIMPLE LINEAR REGRESSION
© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Correlation and Regression Analysis
Chapter 7 Forecasting with Simple Regression
Introduction to Regression Analysis, Chapter 13,
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Simple Linear Regression Least squares line Interpreting coefficients Prediction Cautions The formal model Section 2.6, 9.1, 9.2 Professor Kari Lock Morgan.
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Correlation & Regression
Regression and Correlation Methods Judy Zhong Ph.D.
SIMPLE LINEAR REGRESSION
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.
Simple Linear Regression
BPS - 3rd Ed. Chapter 211 Inference for Regression.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 21/09/2015 7:46 PM 1 Two-sample comparisons Underlying principles.
Statistics for Business and Economics Chapter 10 Simple Linear Regression.
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
© 2003 Prentice-Hall, Inc.Chap 13-1 Basic Business Statistics (9 th Edition) Chapter 13 Simple Linear Regression.
Introduction to Linear Regression
Correlation and Regression Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Applied Quantitative Analysis and Practices LECTURE#23 By Dr. Osman Sadiq Paracha.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 21/10/ :24 PM 1 Review and important concepts Biological.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 23/10/2015 9:22 PM 1 Two-sample comparisons Underlying principles.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Lecture 10: Correlation and Regression Model.
Applied Quantitative Analysis and Practices LECTURE#25 By Dr. Osman Sadiq Paracha.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 24/01/2016 8:44 PM 1 Simple linear regression What regression.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
Statistics for Managers Using Microsoft® Excel 5th Edition
University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 20/02/ :23 PM 1 Multiple comparisons What are multiple.
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L11.1 Simple linear regression What regression analysis does The simple.
BPS - 5th Ed. Chapter 231 Inference for Regression.
The “Big Picture” (from Heath 1995). Simple Linear Regression.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 28/06/2016 4:11 PM 1 Review and important concepts.
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L12.1 Lecture 12: Generalized Linear Models (GLM) What are they? When do.
University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 06/07/2016 6:16 AM 1 Single classification analysis of variance.
Inference for Least Squares Lines
Statistics for Managers using Microsoft Excel 3rd Edition
Simple Linear Regression
Presentation transcript:

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 1 Data analysis project Proposal must be approved Strong suggestion to submit a draft before Dec. 2 nd Description of question, H, prediction, protocol, data (1-2 pages) Show the data (1-5 graphs) Report analyses results, justify choices, hypotheses tested (1-5 pages) statistical, and biological interpretation (1-2 pages) Total: less than 10 pages.

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 2 Grading scheme Question, protocol, data, data copy (20%) Data examination, analysis, stat interpretation (60%) Biological conclusion (10%) Style (10%)

University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 3 Simple linear regression What regression analysis does The simple regression model Hypothesis testing in regression Residual analysis Inverse prediction, replicated regression and weighted regression Regression caveats Power considerations in simple linear regression

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 4 What regression does Fits a straight line through a cloud of data. Tests and quantifies the effect of an independent variable X on a dependent variable Y. Intensity of the effect is given by the slope (b) of the regression. The importance of the effect is given by the coefficient of determination (r 2 ). X Y XX YY b =  Y  X

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 5 Regression and correlation coefficients The slope b is estimated as: The correlation r is: So, b = r if X and Y have the same variance… and if b = 0, r = 0 and vice versa.

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 6 How it does it by the method of least squares, which involves minimizing the sum of squared deviations between the observations and the regression line, i.e. minimizing the residuals Squared deviation of an observation given by: X Y ii Residual:

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 7 Regression or correlation? Correlation: degree of association between two variables X and Y; no causal relationship assumed! Regression: to predict the value of the dependent variable if the independent variable were changed; causal relationship assumed!

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 8 When do we use regression? Don’t use it to determine the strength of association between to variables. Do use it if you want to predict the value of Y given X. X Y Regression X1X1 X2X2 Correlation

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 9 The simple regression model The regression model is: So, all simple regression models are described by 2 parameters, the intercept (  ) and slope (b). b =  Y  X (slope) X XX YY  (intercept) ii XiXi YiYi Observed Expected

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 10 Assumptions Residuals are independent and normally distributed. The variance of the residuals is equal for all X (homoscedasticity). The relationship between Y and X is linear. There is no measurement error on X (Model I regression).

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 11 Measurement error Assumption of no error on X can be examined beforehand, and is almost invariably violated. Only of concern when measurement error is large relative to magnitude of X (say, > 10%). If assumption is invalid, then Model II regression is required.

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 12 Residual analysis I: independence Plot residuals against estimates, look for patterns. Estimate Residual

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 13 Residual analysis II: Normality Plot residuals against estimates; look for patterns. Do normal probability plot. Check with Lilliefors test. NEDs Residual Normal Non-normal Residual Estimate

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 14 Residual analysis III: Homoscedasticity Plot residuals against estimates; look for patterns. Check with Levene’s test by grouping Y’s into several classes. Estimate Residual Group 1 Group 2 Group 3 Residual Estimate

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 15 Residual analysis IV: Linearity Plot residuals against estimates; look for patterns. Residual X Y Estimate

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 16 Robustness of regression with respect to violation of assumptions

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 17 What to do when assumptions aren’t met Try transforming the data, but remember: (1) for some data, no transformation will work; (2) finding an appropriate transformation may not be easy. Use non-linear regression.

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 18 Transformations in regression

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 19 oCoC Transformations in regression Chirps/min oCoC Chirps/min (log scale) Chirp rate as a function of temperature in males of the cricket Oecanthus fultoni. Chirp rate as a function of temperature in males of the cricket Oecanthus fultoni.

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 20 Transformations in regression Relative brightness (times) in log scale Millivolts Relative brightness (times) Millivolts Electrical resistance as a function of illumination in cephalopod eyes.

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 21 Age and size of, guess what?

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 22

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 23 *** Linear Model *** Call: lm(formula = log10(FKLNGTH) ~ log10(AGE), data = Reg1dat, na.action = na.exclude) Residuals: Min 1Q Median 3Q Max Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) log10(AGE) Residual standard error: on 73 degrees of freedom Multiple R-Squared: F-statistic: on 1 and 73 degrees of freedom, the p-value is 0 5 observations deleted due to missing values

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 24 Hypothesis testing I: partitioning the total sums of squares Total SSModel (Explained) SSUnexplained (Error) SS = + Y

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 25 Hypothesis testing I: partitioning the total sums of squares So, MS regression = s 2 Y and MS error = 0 if observed = expected. Calculate F = MS R /MS e and compare with F distribution with 1 and N - 2 df. H 0 : F = 0

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 26 Standard error of the slope The standard error s b and 100(1-  ) CIs of the slope are: So, for fixed N, can decrease s b by expanding range of X values sampled. Y X s b smaller Y s b larger

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 27 Standard error of the intercept The standard error s  of the intercept  is: So, for fixed N, we can decrease s  by expanding range of X values sampled. X s  smaller Y  Y s  larger 

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 28 Hypothesis testing II: testing model parameters Test each hypothesis by a t-test: Note: these are 2-tailed hypotheses! X Y  H 02 : b = 0 X Y  Y  Y H 01 :  = 0 Y = 0 Observed Expected

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 29 Hypothesis testing III: one-tailed hypotheses Biological theory predicts that Y should increase with X. So, H 0 : b  0 (one-tailed) Calculate: Reject if t b > 0 and p (one- tailed) <  YY H 0 accepted H 0 rejected Y X Y

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 30 Confidence intervals in regression 100 (1-  ) CI for estimated values 100 (1-  ) CI for observations

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 31 Confidence intervals in regression CI for observations is larger than CI for estimated values. CIs for both estimated values and observations increase with increasing distance between X value and mean of sample. X Y Observations Y Estimates

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 32 Outliers points that appear to lie well off the fitted line Issue 1: are “apparent” outliers really outliers? Issue 2: do they significantly affect the statistical conclusions? X Y Outlier?

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 33 Outlier analysis I: Studentized residuals Plot Studentized residuals against estimated values. “Large” residuals are those with value > 3.0. Such cases make large contributions to residual mean square of the regression.

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 34 Outlier analysis II: Leverage Leverage measures the potential influence of the case on the regression line. Determined by X value only, so that points far from the mean have higher leverage. “Large” = anything greater than 4/N. Small leverage Large leverage X Y

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 35 Outlier analysis III: Cook’s distance Cook’s distance: measures both leverage and contribution to residual mean square, i.e. actual influence of a point. “Large” = anything greater than 1. Smaller Cook’s Larger Cook’s X Y

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 36 Resolving outlier problems Do they have a significant effect on regression results? To determine, delete them, rerun analyses and compare results. Are slope and intercept estimates significantly affected, i.e. still lie within 95% CI’s of original estimates? Outliers in Outliers out Y No significant effect X Y Significant effect

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 37 The effects of outlier deletion Reduces sample size (N), thereby reducing power. Decreases MS e, so s b decreases, and power increases. If N is small, the former effect will probably outweigh the latter unless outliers are very aberrant. Power (1 -  )  N smaller N larger s b larger s b smaller s b fixed N fixed 0 0 1

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 38 Inverse prediction Regression of Y on X, but want to predict X, given Y. Regression of X on Y not possible due to error in Y. e.g. calibration curves: want to predict concentration from reading, based on regression of reading on known solute concentrations. Reading Concentration Reading Concentration Error in “X”

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 39 Inverse prediction Regress Y on X. Generate predicted value of X given Y. Calculate 95% confidence limits for “X” estimate based on 95% confidence limits for “Y” estimate from standard regression. Y Predicted “X” Lower 95% limit Upper 95% limit

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 40 Regression with replication When several Y’s are measured for each X. In this case, we can test the linearity assumption directly by testing the MS due to deviations from linearity over MS within groups. Regression SS Within-group SS SS due to nonlinearity Group SS Error SS

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 41 Weighted regression Used when our confidence in the values of individual observations varies, e.g. different measurement error, precision. In replicated designs, variance of Y for given X may vary among X’s, as may sample size (N). So, weight by N or inverse of sample variance. X Y

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 42 Regression caveats I: causation A statistically significant regression of Y on X need not imply a causal relationship between the two. A non-significant linear regression need not imply the lack of a causal relationship if the causal relationship is non- linear. Z X Y X Y X Y Accept linear H 0

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 43 Regression caveats II: small samples Significant regressions can be obtained by chance, i.e. even when no (linear) causal relationship exists. This is especially true if sample sizes are small. So when doing multiple simple regressions, control  e. X Y True regression (H 0 accepted) Sample regression (H 0 rejected)

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 44 Regression caveats III: large samples When N is large, only very small regression coefficients are required to reject H 0 (power is large). So, be careful of “overinterpreting” the observed relationship if R 2 is small. True regression (H 0 rejected but R 2 small) X Y

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 45 Regression caveats IV: extrapolation and interpolation Be careful when (1) predictions lie outside range of sample; (2) when predictions are for values where data are sparse. X Y Estimated relation True relation X Y Predicted value True value Observations

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 46 The final word on extrapolation In the space of one hundred and seventy-six years the Lower Mississippi has shortened itself two hundred and forty-six miles. That is an average of a trifle over one mile and a third per year. Therefore, any calm person, who is not blind or idiotic, can see that in the Old Oölitic Silurian period, just a million years ago next November, the Lower Mississippi River was upwards of one million three hundred thousand miles long, and stuck over the Gulf of Mexico like a fishing rod. And by the same token, any person can see that seven hundred and forty-two years from now, the lower Mississippi will be only a mile and three-quarters long, and Cairo and New Orleans will have joined their streets together, and be plodding comfortably along under a single mayor and a mutual board of aldermen. Mark Twain, Life on the Mississippi

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 47 Power and sample size in simple linear regression Because the correlation coefficient r and the regression coefficient b are closely related, i.e. … we can transform b to r and evaluate power using r. X Y

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 48 Power and sample size regression If we test H 0 : b = 0 with sample size n, we can determine 1 -  by calculating the z-transformed values for the critical value of the corresponding r (at specified  ) (z  ) and the sample regression coefficient b (z r ), and the one-tailed probability of the normal deviate: X Y

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 49 Power and sample size in regression Once Z  (1) is determined, we can calculate the probability of obtaining a Z-value of this size or greater, i.e. . Power is then 1- . X Y Z  (1) p 

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 50 Power and sample size in regression: an example Changes in wing length with age in a sample of 13 birds So 1 -  = 1.00.

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 51 Minimal sample size in regression Given desired power 1 - , how large a sample is required to reject H 0 : b  = 0 if it is false and the true regression coefficient is at least b   To do so, first calculate regression coefficient  0 corresponding to b . X1X1 Y Reject H 0 ? Observed Expected under H 0 : b = 0 True regression (b 0 ) Y Reject H 0 ?

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 52 Minimal sample size in regression (cont’d) …then calculate: X1X1 Y Reject H 0 ? Observed Expected under H 0 : b = 0 True regression (b 0 ) Y Reject H 0 ?

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 53 Minimal sample size: an example We want to reject H 0 : b  = 0 99% of the time when b 0  > 0.2  and   (2) =.05  So  (1) =.01 and For b =.20, we have...

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 54 Minimal sample size (cont’d) So… …and So, a sample size of at least 8 should be used.

University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 55 CorrelationCorrelation The underlying principle of correlation analysis Measuring the strength of a correlation Assumptions Confidence intervals and hypothesis testing Comparing correlations Non-parametric correlations Power in correlation analysis

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 56 The underlying principle of correlation analysis Measures the extent to which two variables covary, in particular, the strength of the linear association between them. No implied causal relationship, therefore there is no distinction between dependent and independent variables. X1X1 X2X2

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 57 When do we use correlation? Do use it to determine the strength of association between to variables. Do not use it if you want to predict the value of X given Y, or vice versa. X1X1 X2X2 Correlation X Y Regression

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 58 Simple linear correlation versus simple linear regression Calculations are the same. In correlation analysis, one must sample randomly both X and Y. Correlation deals with association (importance). Regression deals with prediction (intensity).

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 59 Lab example: fork length and round weight of sturgeon Since the two variables are not causally related, use correlation to measure strength of association.

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 60 Regression: fork length and age of sturgeon The two variables are causally related. The relationship between the two provides an estimate of growth rates…...and we can use the relationship to predict the size of sturgeon of a given age.

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 61 Measuring the strength of a correlation Test statistic is the product-moment correlation coefficient r. X1X1 X2X2

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 62 Measuring the strength of a correlation r always lies between -1 and 1. r 2 is the coefficient of determination, which measures the proportion of the variance in X 1 (or X 2 ) “explained” by variation in X 2 or X 1. X1X1 X2X2 X2X2 X2X2 r = 0.9 r = 0.5 r = 0 r = -0.5 r = -0.9

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 63 Assumptions of correlation analysis I: Bivariate normality For each value of X 1, X 2 values are normally distributed, and vice versa. r = 0.8 r = 0

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 64 Assumptions of correlation analysis II: Homoscedasticity The variance of X 1, given X 2, is independent, and vice versa. But the variances of X 1 and X 2 need not be equal. X2X2 X1X1 X2X2 Homoscedastic Heteroscedastic

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 65 Assumptions of correlation analysis III: Linearity The relationship between X 1 and X 2 is linear. X2X2 Linear X1X1 X2X2 Nonlinear

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 66 Violation of assumptions: fork length and age of sturgeon Relationship between fork length and age appears non-linear. Variance in fork length appears to increase with age.

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 67 If parametric correlation assumptions aren’t met... Try transforming the data (e.g. log transform). Try a non-parametric correlation analysis.

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 68 Confidence intervals for correlation coefficients  confidence limit for Z- transformed correlation given by: Convert back to untransformed CI by: X2X2 Smaller CI X2X2 X1X1 X2X2 Larger CI

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 69 Hypothesis testing I H 0 :  = 0 Standard error of correlation coefficient given by: Calculate … and compare to t-distribution with N - 2 df. X2X2 Reject H 0 X2X2 Accept H 0 X1X1 X2X2 Observed Expected

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 70 Hypothesis testing II H 0 : r =  Transform r and  to Calculate … and compare Z distribution with N - 3 df. X2X2 Reject H 0 X2X2 X1X1 X2X2 Accept H 0 Observed Expected

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 71 Comparing 2 correlations H 0 : r 1 = r  Transform r 1 and r  to: Calculate … and compare to Z distribution. X2X2 Reject H 0 X2X2 X1X1 X2X2 Accept H 0 r1r1 r2r2

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 72 Comparing multiple correlations H 0 : r i = r j = r k = … based on n i, n j, n k …observations Z transform all r i s to z i s and calculate … and compare to  2 distribution with df = k -1.

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 73 Computing common correlations If H 0 : r i = r j = r k = … is accepted, then each r i estimates the same (population) correlation . To calculate , first calculate weighted Z-score z w : Then back-transform to get  X2X2 X1X1 X2X2 Accept H 0 r1r1 r2r2 r3r3

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 74 Non-parametric correlations Use when one or more assumptions are not met. Essentially a parametric correlation of the ranks. Most common statistic is Spearman rank correlation. X2X2 X1X1 Rank X 1 Rank X 2

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 75 Power and sample size in correlation If we test H 0 :  = 0 with sample size n, we can determine 1 -  by using the Z-transformation for critical values (for given  ) of the true correlation  (z  ) and sample correlation r (z r ). X1X1 X2X2 Z Probability 

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 76 Power and sample size in correlation Once Z  (1) is determined, we can calculate the probability of obtaining a Z-value of this size or greater, i.e. . Power is then 1- . X1X1 X2X2 Z Probability 

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 77 Power and sample size in correlation: an example Correlation of wing length and tail length of a sample of 12 birds so 1 -  = 0.98

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 78 Minimal sample size Given desired power 1 - , how large a sample is required to reject H 0 :  = 0 if it is false with a specified    Calculate: X2X2 Reject H 0 ? X2X2 X1X1 X2X2 Observed Expected

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 79 Minimal sample size: an example We want to reject H 0 :  = 0 99% of the time when |    > 0.5 and   (2)  =.05  So  (1) =.01 and for r =.50, we have... Hence So, a sample size of at least 64 should be used.

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 80 Power and sample size in comparing 2 correlations Power of a test for difference between two correlation coefficients is 1- , where  is one-tailed probability of: X2X2 Reject H 0 X2X2 X1X1 X2X2 Accept H 0 r1r1 r2r2

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 18/08/2015 2:25 AM 81 An example What is power to detect a difference? From table of normal deviates, So, power = 0.22