Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHA Professor and Executive Director, Research Center University of South Florida, College of Nursing Professor, College of Public Health Department of Epidemiology and Biostatistics Associate Member, Byrd Alzheimer’s Institute Morsani College of Medicine Tampa, FL, USA 1
SECTION 5.1 Parameters and factors that affect sample size Sample size estimation and correlation
SECTION 5.6 Sample size estimates for a two sample (independent groups) dichotomous outcome
Learning Outcome: Calculate and interpret sample size estimates for two sample (independent groups) dichotomous outcome ---Estimate for a confidence interval ---Estimate for a hypothesis test
Sample Size to Estimate C.I. Sample Size for Hypothesis Test C.I. for (p 1 – p 2 ) n i = [p 1 (1-p 1 ) + p 2 (1-p 2 )] Z E Z 1 – α/2 + Z 1 – β ES H 0 : p 1 = p Dichotomous Outcome – Two Independent Samples n i = 2 ES = | p 1 – p 2 | p(1– p)
Sample Size to Estimate C.I. C.I. for (p 1 – p 2 ) n i = [p 1 (1-p 1 ) + p 2 (1-p 2 )] Z E 2 Dichotomous Outcome – Two Independent Samples (C.I.) Example: Estimate required sample size for 95% C.I. for the difference in the incidence proportion of adults over 50 who develop prostate cancer (over 30 years) by smoking status (non-smokers vs. heavy smokers). Parameters: Margin of error: 5% Assumed prevalence (Non-smoker)p 1 = 0.17 Assumed prevalence (Smoker)p 2 = 0.34 Assumed dropout rate:20% Desired C.I.: 95% (i.e. z = 1.96) n i = [0.17(1-0.17) (1-0.34)] n 1 = n 2 = n = Take into account the drop out rate: N (number to enroll) / (% retained) N = / 0.80 = 1404 subjects
Sample Size to Estimate C.I. C.I. for (p 1 – p 2 ) n i = [p 1 (1-p 1 ) + p 2 (1-p 2 )] Z E 2 Dichotomous Outcome – Two Independent Samples (C.I.)(Practice) Example: Estimate required sample size for 95% C.I. for the difference in the annual incidence proportion of depression among teenagers by psychological trauma (trauma vs. no trauma). Parameters: Margin of error: 5% Assumed prevalence (No trauma)p 1 = 0.06 Assumed prevalence (Trauma)p 2 = 0.12 Assumed dropout rate:10% Desired C.I.: 95% (i.e. z = 1.96) n i = [ ] n 1 = _____ n 2 = _____ n = _____ Take into account the drop out rate: N (number to enroll) / (% retained) N = ________________________
Sample Size to Estimate C.I. C.I. for (p 1 – p 2 ) n i = [p 1 (1-p 1 ) + p 2 (1-p 2 )] Z E 2 Dichotomous Outcome – Two Independent Samples (C.I.)(Practice) Example: Estimate required sample size for 95% C.I. for the difference in the annual incidence proportion of depression among teenagers by psychological trauma (trauma vs. no trauma). Parameters: Margin of error: 5% Assumed prevalence (No trauma)p 1 = 0.06 Assumed prevalence (Trauma)p 2 = 0.12 Assumed dropout rate:10% Desired C.I.: 95% (i.e. z = 1.96) n i = [0.06(1-0.06) (1-0.12)] n 1 = n 2 = n = Take into account the drop out rate: N (number to enroll) / (% retained) N = / 0.90 = subjects
Sample Size for Hypothesis Test Z 1 – α/2 + Z 1 – β ES H 0 : p 1 = p 2 2 Dichotomous Outcome – Two Independent Samples (H 0 Test) n i = 2 ES = | p 1 – p 2 | p(1– p) Example: Compare the prevalence of hypertension in a trial of a new drug versus placebo. Parameters/Assumptions: Margin of error: 20% reduction Assumed prevalence (Placebo)p 1 = 0.30 Assumed prevalence (Drug)p 2 = 0.24 Assumed dropout rate:10% 2-sided type I error rate (α):0.05 Desired power (1-β):0.80 ES = | 0.30 – 0.24 | 0.27(1– 0.27) p = 0.27 = n i = = Take into account the drop out rate: N (number to enroll) / (% retained) N = 1717 / 0.90 = 1908 subjects n 1 = n 2 = n = 1717 A sample size of n = 1908 will ensure that a 2- sided test with α=0.05 has 80% power to detect a 20% reduction in the prevalence of hypertension attributed to the new drug.
Sample Size for Hypothesis Test Z 1 – α/2 + Z 1 – β ES H 0 : p 1 = p 2 2 Dichotomous Outcome – Two Independent Samples (H 0 Test)(Practice) n i = 2 ES = | p 1 – p 2 | p(1– p) Example: Compare prevalence of hyperglycemia in a trial of a new drug versus placebo. Parameters/Assumptions: Margin of error: 40% reduction Assumed prevalence (Placebo)p 1 = 0.50 Assumed prevalence (Drug)p 2 = 0.30 Assumed dropout rate:15% 2-sided type I error rate (α):0.05 Desired power (1-β):0.80 ES = | p = 0.40 = _____ ni =ni = = ____ Take into account the drop out rate: N (number to enroll) / (% retained) N = ________________________ n 1 = ____ n 2 = ____ n = _____
Sample Size for Hypothesis Test Z 1 – α/2 + Z 1 – β ES H 0 : p 1 = p 2 2 Dichotomous Outcome – Two Independent Samples (H 0 Test)(Practice) n i = 2 ES = | p 1 – p 2 | p(1– p) Example: Compare prevalence of hyperglycemia in a trial of a new drug versus placebo. Parameters/Assumptions: Margin of error: 40% reduction Assumed prevalence (Placebo)p 1 = 0.50 Assumed prevalence (Drug)p 2 = 0.30 Assumed dropout rate:15% 2-sided type I error rate (α):0.05 Desired power (1-β):0.80 ES = | 0.50 – 0.30 | 0.40(1– 0.40) p = 0.40 = n i = = 94.1 Take into account the drop out rate: N (number to enroll) / (% retained) N = / 0.85 = subjects n 1 = 94.1 n 2 = 94.1 n = A sample size of n = 222 will ensure that a 2-sided test with α=0.05 has 80% power to detect a 40% reduction in the prevalence of hyperglycemia attributed to the new drug.
SECTION 5.7 Introduction to correlation
Learning Outcome: Describe the conceptual basis and properties of the correlation coefficient.
Correlation and Regression are both measures of association “Association” Statistical dependence between two variables: Exposure(e.g. risk factor, protective factor, predictor variable, treatment) Outcome(e.g. disease, event)
“Association” Example: The degree to which the rate of disease in persons with a specific exposure is either higher or lower than the rate of disease among those without that exposure. Correlation and Regression are both measures of association
Correlation and Regression are both measures of association Some Terms for “association” variables: Variable 1:“x” variable independent variable predictor variable exposure variable Variable 2:“y” variable dependent variable outcome variable
Correlation Coefficient Different types depending on numerical properties of “x” and “y” variables Pearson: 2 continuous variables (~ normally distributed) Spearman: 2 continuous variables (>1 variable not normally distributed) Point bi-serial: one continuous and one binary variable Phi-coefficient: two dichotomous variables
Correlation Coefficient Properties of correlation coefficients: Range of -1.0 to 1.0 Value of -1.0 (perfect negative correlation) Value of 1.0 (perfect positive correlation) Value of 0 (no correlation (“association”)) As a rule of thumb, correlation coefficients: 0.0 to 0.30: “weak” 0.30 to 0.70: “moderate” 0.70 to 1.0: “high Usually, the p-value generated for r is based on the null hypothesis H 0 that r = 0.
Other points to note: The correlation coefficient is unaffected by units of measurement Correlations does not imply causation Correlation should not be used when: a)There is a non-linear relationship between variables b)There are outliers c)There are distinct sub-group effects Correlation coefficients are spurious
SECTION 5.8 Calculate and interpret correlation coefficients
Learning Outcome: Calculate and interpret correlations coefficients: Pearson and Spearman (interpretation only)
Correlation Coefficient Computation Form: Pearson correlation (“r”) where x and y are the sample means of X and Y, s x and s y are the sample standard deviations of X and Y. Co-variation
The t-test for the correlation coefficient A t-test can be used to test whether the correlation between two variables is significant. The test statistic is t Guidelines: Using the t-test for the correlation coefficient 1. State H 0 and H Specify α. 3. Determine the degrees of freedom. d.f. = n – 2 4. Find the critical value(s) from table 2 with n-2 degrees of freedom 5. Compute the test statistic.
Example: Assume a correlation coefficient of 0.28 is observed with a sample size of n = 26. We wish to test this relationship in a 2-sided manner with α = State H 0 and H 1 H 0 : r = 0;H 1 : r = 0; 2. Specify α. α = 0.05 (2-sided) 3. Determine the degrees of freedom. d.f. = n – 2 d.f. = 26 – 2 = Find the critical value(s) from table 2 with d.f. = n-2 = Compute the test statistic. t = 0.28 ( ) (26 – 2) t = 1.43 Conclusion: 1.43 < Do not reject H 0
Practice: Assume a correlation coefficient of 0.43 is observed with a sample size of n = 22. We wish to test this relationship in a 2-sided manner with α = State H 0 and H 1 H 0 : _____;H 1 : _____; 2. Specify α. α = ___________ 3. Determine the degrees of freedom. d.f. = n – 2 d.f. = ______ 4. Find the critical value(s) from table 2 with d.f. = n-2 = _____ 5. Compute the test statistic. Conclusion: Accept or Reject H 0 t = _____
Practice: Assume a correlation coefficient of 0.43 is observed with a sample size of n = 22. We wish to test this relationship in a 2-sided manner with α = State H 0 and H 1 H 0 : r = 0;H 1 : r = 0; 2. Specify α. α = 0.05 (2-sided) 3. Determine the degrees of freedom. d.f. = n – 2 d.f. = 22 – 2 = Find the critical value(s) from table 2 with d.f. = n-2 = Compute the test statistic. t = 0.43 ( ) (22 – 2) t = 2.13 Conclusion: 2.13 > Reject H 0
Subject ID“x”“y”x i - xy i - y(x i – x) (y i – y) Sum of all observations Mean value Standard deviation So, r xy = = = 0.84 (8 - 1) x (6.24 x 10.18) See SAS page 1
Subject ID“x”“y”x i - xy i - y(x i – x) (y i – y) 1816 ??? ??? 36 ??? 4116 ??? ??? ??? 7816 ??? 8612 ??? Sum of all observations??? Mean value?? Standard deviation So, r xy = _________________________________ Practice Calculation
Subject ID“x”“y”x i - xy i - y(x i – x) (y i – y) Sum of all observations Mean value Standard deviation So, r xy = = = 0.47 (8 - 1) x (3.72 x 7.25) See SAS page 2
Correlation Coefficient Computation Form: Pearson correlation (“r”) From the formula above, it should be intuitive that the Pearson R is sensitive to extreme values
IDXY R0.161 See SAS pages 3-4
IDXY R0.573 See SAS pages 5-6
Correlation Coefficient With extreme values, you can use the Spearman “rank” correlation procedure to remove the undue influence of the extreme values. Assuming no ties in ranks Where d i = x i − y i between the ranks of each observation
Example: Incorrect use of Pearson R IDXY R0.696 See SAS page 7
IDXYXYRank XRank Ydidi R0.696Sum of178 6 x So, R s = = = (100-1) 990 See SAS page 8
SECTION 5.9 Use of correlation in Excel, Power Point, and SPSS
Learning Outcomes: Calculate correlation coefficients in Excel and SPSS Produce a scatter plot in Power Point to depict correlation
Calculate Correlation Coefficients Excel Plot in Power Point Excel: (refer to Excel spreadsheet) =CORREL(Array 1,Array 2) =CORREL(A4:A15,B4:B15) XY
Power Point: ---Insert Chart ---X-Y- Scatter ---Add Trend Line (click on data points) r = 0.76
SPSS: Analyze Correlate, Bivariate Pearson, Spearman Age, Body Mass Index
SPSS: Analyze Correlate, Bivariate Pearson, Spearman Glucose, Triglycerides