Contingency tables Brian Healy, PhD. Types of analysis-independent samples OutcomeExplanatoryAnalysis ContinuousDichotomous t-test, Wilcoxon test ContinuousCategorical.

Slides:



Advertisements
Similar presentations
Contingency Tables Prepared by Yu-Fen Li.
Advertisements

Comparing Two Proportions (p1 vs. p2)
Hypothesis Testing Steps in Hypothesis Testing:
KRUSKAL-WALIS ANOVA BY RANK (Nonparametric test)
Logistic Regression.
Departments of Medicine and Biostatistics
Testing means, part III The two-sample t-test. Sample Null hypothesis The population mean is equal to  o One-sample t-test Test statistic Null distribution.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
EPI 809 / Spring 2008 Final Review EPI 809 / Spring 2008 Ch11 Regression and correlation  Linear regression Model, interpretation. Model, interpretation.
Measures of Disease Association Measuring occurrence of new outcome events can be an aim by itself, but usually we want to look at the relationship between.
Chapter 10 Simple Regression.
Final Review Session.
Correlation. Two variables: Which test? X Y Contingency analysis t-test Logistic regression Correlation Regression.
Chapter 11 Survival Analysis Part 2. 2 Survival Analysis and Regression Combine lots of information Combine lots of information Look at several variables.
Chapter 11 Multiple Regression.
Inferences About Process Quality
Chapter 9 Hypothesis Testing.
5-3 Inference on the Means of Two Populations, Variances Unknown
Sample Size and Statistical Power Epidemiology 655 Winter 1999 Jennifer Beebe.
ANOVA and Regression Brian Healy, PhD.
Linear regression Brian Healy, PhD BIO203.
Lecture 9: p-value functions and intro to Bayesian thinking Matthew Fox Advanced Epidemiology.
Survival analysis Brian Healy, PhD. Previous classes Regression Regression –Linear regression –Multiple regression –Logistic regression.
The Chi-Square Test Used when both outcome and exposure variables are binary (dichotomous) or even multichotomous Allows the researcher to calculate a.
AM Recitation 2/10/11.
The Chi-Square Distribution 1. The student will be able to  Perform a Goodness of Fit hypothesis test  Perform a Test of Independence hypothesis test.
HAWKES LEARNING SYSTEMS math courseware specialists Copyright © 2010 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Chapter 14 Analysis.
Statistics for clinical research An introductory course.
Amsterdam Rehabilitation Research Center | Reade Testing significance - categorical data Martin van der Esch, PhD.
Section 10.1 ~ t Distribution for Inferences about a Mean Introduction to Probability and Statistics Ms. Young.
1 Tests with two+ groups We have examined tests of means for a single group, and for a difference if we have a matched sample (as in husbands and wives)
Education 793 Class Notes T-tests 29 October 2003.
Independent samples- Wilcoxon rank sum test. Example The main outcome measure in MS is the expanded disability status scale (EDSS) The main outcome measure.
PROBABILITY & STATISTICAL INFERENCE LECTURE 6 MSc in Computing (Data Analytics)
Introduction to Biostatistics/Hypothesis Testing
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
Dr.Shaikh Shaffi Ahamed Ph.D., Dept. of Family & Community Medicine
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 8 – Comparing Proportions Marshall University Genomics.
Chapter 9: Non-parametric Tests n Parametric vs Non-parametric n Chi-Square –1 way –2 way.
From Theory to Practice: Inference about a Population Mean, Two Sample T Tests, Inference about a Population Proportion Chapters etc.
Testing Hypotheses about Differences among Several Means.
Causation ? Tim Wiemken, PhD MPH CIC Assistant Professor Division of Infectious Diseases University of Louisville, Kentucky.
Chapter 12 A Primer for Inferential Statistics What Does Statistically Significant Mean? It’s the probability that an observed difference or association.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
The binomial applied: absolute and relative risks, chi-square.
MBP1010 – Lecture 8: March 1, Odds Ratio/Relative Risk Logistic Regression Survival Analysis Reading: papers on OR and survival analysis (Resources)
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 26.
Analysis of Qualitative Data Dr Azmi Mohd Tamil Dept of Community Health Universiti Kebangsaan Malaysia FK6163.
Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHA Professor and Executive Director, Research Center University of South Florida,
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 8 Hypothesis Testing.
STATISTICAL ANALYSIS FOR THE MATHEMATICALLY-CHALLENGED Associate Professor Phua Kai Lit School of Medicine & Health Sciences Monash University (Sunway.
Analysis of Variance (ANOVA) Brian Healy, PhD BIO203.
VI. Regression Analysis A. Simple Linear Regression 1. Scatter Plots Regression analysis is best taught via an example. Pencil lead is a ceramic material.
N318b Winter 2002 Nursing Statistics Specific statistical tests Chi-square (  2 ) Lecture 7.
1 G Lect 7a G Lecture 7a Comparing proportions from independent samples Analysis of matched samples Small samples and 2  2 Tables Strength.
© Copyright McGraw-Hill 2004
More Contingency Tables & Paired Categorical Data Lecture 8.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Section 6.4 Inferences for Variances. Chi-square probability densities.
How to do Power & Sample Size Calculations Part 1 **************** GCRC Research-Skills Workshop October 18, 2007 William D. Dupont Department of Biostatistics.
THE CHI-SQUARE TEST BACKGROUND AND NEED OF THE TEST Data collected in the field of medicine is often qualitative. --- For example, the presence or absence.
Fall 2002Biostat Inference for two-way tables General R x C tables Tests of homogeneity of a factor across groups or independence of two factors.
Hypothesis Tests u Structure of hypothesis tests 1. choose the appropriate test »based on: data characteristics, study objectives »parametric or nonparametric.
Objectives (BPS chapter 12) General rules of probability 1. Independence : Two events A and B are independent if the probability that one event occurs.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
Applied Biostatistics: Lecture 2
The binomial applied: absolute and relative risks, chi-square
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
Logistic Regression.
Research Techniques Made Simple: Interpreting Measures of Association in Clinical Research Michelle Roberts PhD,1,2 Sepideh Ashrafzadeh,1,2 Maryam Asgari.
Presentation transcript:

Contingency tables Brian Healy, PhD

Types of analysis-independent samples OutcomeExplanatoryAnalysis ContinuousDichotomous t-test, Wilcoxon test ContinuousCategorical ANOVA, linear regression ContinuousContinuous Correlation, linear regression DichotomousDichotomous Chi-square test, logistic regression DichotomousContinuous Logistic regression Time to event Dichotomous Log-rank test

Example MS is known to have a genetic component MS is known to have a genetic component Several single nucleotide polymorphisms have been associated with susceptibility to MS Several single nucleotide polymorphisms have been associated with susceptibility to MS Question: Do patients with susceptibility SNPs experience more sustained progression than patients without susceptibility SNPs? Question: Do patients with susceptibility SNPs experience more sustained progression than patients without susceptibility SNPs?

Data Initially, we will focus on presence vs. absence of SNPs Initially, we will focus on presence vs. absence of SNPs Among our 190 GA treated patients, 74 had the SNP and 116 did not Among our 190 GA treated patients, 74 had the SNP and 116 did not –12 patients with the SNP experienced sustained progression –13 patients without the SNP experienced sustained progression

Another way to look at the data Rather than investigating two proportions, we can look at a 2x2 table of the same data Rather than investigating two proportions, we can look at a 2x2 table of the same data SNP+SNP-Total Prog No prog Total

Question In our analysis, we assume that the margins are set In our analysis, we assume that the margins are set If there was no relationship between the two variables, what would we expect the values in the table be? If there was no relationship between the two variables, what would we expect the values in the table be?

Example As an example, use this table As an example, use this table SNP+SNP-Total Prog100 No prog 100 Total *100/200 =25 150*100/200= 75

Expected table Expected table for our analysis Expected table for our analysis SNP+SNP-Total Prog25 No prog 165 Total *74/190= *165/ 190= *74/190 = *116/190 =15.3 How different is our observed data compared to the expected table?

Does our data show an effect? To test for an association between the outcome and the predictor, we would like to know if our observed table was different from the expected table To test for an association between the outcome and the predictor, we would like to know if our observed table was different from the expected table How could we investigate if our table was different? How could we investigate if our table was different?

Chi-square distribution This statistic follows a chi-square distribution with 1 degree of freedom This statistic follows a chi-square distribution with 1 degree of freedom Assume x is a normal random variable with mean=0 and variance=1 Assume x is a normal random variable with mean=0 and variance=1 –x 2 has a chi-square distribution with 1 degree of freedom

Chi-square distribution Area=0.05 X 2 =3.84

Critical information for  2 For 1 degree of freedom, cut-off for  =0.05 is 3.84 For 1 degree of freedom, cut-off for  =0.05 is 3.84 –For normal distribution, this is 1.96 –Note =3.84 Inherently, two-sided since it is squared Inherently, two-sided since it is squared

Hypothesis test with  2 1) H 0 : No association between SNP and progression 2) Dichotomous outcome, dichotomous predictor   2 test 4) Test statistic:  2 =0.99 5) p-value=0.32 6) Since the p-value is greater than 0.05, we fail to reject the null hypothesis 7) We conclude that there is no significant association between SNP and progression

 2 statistic p-value

Hypothesis test comparison Yesterday, we completed this same test using a comparison of proportions Yesterday, we completed this same test using a comparison of proportions Let’s compare the results Let’s compare the results Method Test statistic p-value Test of proportions z=0.996p=0.32  2 test  2 =0.992 p=0.32 We get the same result!!!

Question: Continuity correction What is a continuity correction and when should I use it? What is a continuity correction and when should I use it? –Continuity correction subtracts ½ from the numerator of the  2 statistic –Designed to improve performance of normal approximation –Use default in STATA (or other stat package), but know which you are using –Less important today since exact tests are easily used

Question: Why 1 degree of freedom? We used a  2 distribution with 1 degree of freedom, but there are 4 numbers. Why? We used a  2 distribution with 1 degree of freedom, but there are 4 numbers. Why? –For our analysis, we assume that the margins are fixed. –If we pick one number in the table, the rest of the numbers are known SNP+SNP-Total Prog25 No Prog 165 Total

Question: Normal approximation We are using a normal approximation, but yesterday we talked about this being less than perfect. When can we use this test? We are using a normal approximation, but yesterday we talked about this being less than perfect. When can we use this test? –Rule of thumb: All cells larger than 5 –Large samples What should I do if I do not have large samples? What should I do if I do not have large samples? –Fisher’s exact test

Fisher’s exact test Remember that a p-value is the probability of the observed value or something more extreme Remember that a p-value is the probability of the observed value or something more extreme Fisher’s exact test looks at a table and determines how many tables are as extreme or more extreme than the observed table under the null hypothesis of no association Fisher’s exact test looks at a table and determines how many tables are as extreme or more extreme than the observed table under the null hypothesis of no association Same concept as exact test from Wilcoxon test Same concept as exact test from Wilcoxon test Easy to compute this in STATA Easy to compute this in STATA

Hypothesis test with exact test 1) H 0 : No association between SNP and progression 2) Dichotomous outcome, dichotomous predictor 3) Exact test 4) Test statistic: NA 5) p-value=0.38 6) Since the p-value is greater than 0.05, we fail to reject the null hypothesis 7) We conclude that there is no significant association between SNP and progression

Two-sided p-value

Results Our results were very similar to the other tests in part because we have a large sample size Our results were very similar to the other tests in part because we have a large sample size –Normal approximation ok In small samples, larger differences are possible In small samples, larger differences are possible

Types of studies In a cohort study, people are enrolled based on exposure status so we can somewhat control how many exposed and unexposed people we have In a cohort study, people are enrolled based on exposure status so we can somewhat control how many exposed and unexposed people we have In a case-control study, people are enrolled based on disease status so that we ensure that we have both diseased and non-diseased people In a case-control study, people are enrolled based on disease status so that we ensure that we have both diseased and non-diseased people

Measures of association Risk difference Risk difference –Do these added together equal 1? –Why? –Under the null, what is the risk difference? Relative risk (risk ratio) Relative risk (risk ratio) –Under the null, what is the relative risk?

P(Disease+|Exposure+)= a/m 1 =p 1 P(Disease+|Exposure+)= a/m 1 =p 1 –What is another name for this quantity? –Prevalence in patients with exposure P(Disease+|Exposure-)= b/m 2 =p 2 P(Disease+|Exposure-)= b/m 2 =p 2 RD=a/m 1 – b/m 2 RD=a/m 1 – b/m 2 Difference between proportions Difference between proportions Exposure DiseaseYNTotal Yab n1n1n1n1 Ncd n2n2n2n2 Total m1m1m1m1 m2m2m2m2N

Confidence interval for RD Several confidence intervals are available for the RD Several confidence intervals are available for the RD –Asymptotic normal distribution –Confidence interval

Estimate of RR: Estimate of RR: Exposure DiseaseYNTotal Yab n1n1n1n1 Ncd n2n2n2n2 Total m1m1m1m1 m2m2m2m2N

Confidence interval for RR To construct a confidence interval we use a normal approximation To construct a confidence interval we use a normal approximation In addition, the CI is based on a log transformation of the RR In addition, the CI is based on a log transformation of the RR –log(RR)=ln(RR) –I will use ln and log to represent the natural logarithm Quick math: e ln(RR) =RR Quick math: e ln(RR) =RR

ln(RR) Why do we use the ln(RR)? Why do we use the ln(RR)? –It is generally easier to deal with subtraction rather than division –ln(RR)=ln(p 1 /p 2 )=ln(p 1 )-ln(p 2 ) We can estimate the standard error for the ln(RR) using the following formula We can estimate the standard error for the ln(RR) using the following formula

Confidence interval Now that we have an estimate of the variance, we can create a confidence interval for ln(RR) using our standard normal approximation Now that we have an estimate of the variance, we can create a confidence interval for ln(RR) using our standard normal approximation To create a confidence interval for RR, we transform this confidence interval To create a confidence interval for RR, we transform this confidence interval

Estimated proportions in two groups p-value from chi- square test Given the confidence interval, would you reject the null hypothesis? Why?

Interpretation of RD The estimated risk difference is The estimated risk difference is –The interpretation of this is that the risk of progression for patients with the susceptibility allele is 5% higher than for patients without the allele The 95% confidence interval for the risk difference is (-0.052,0.152) The 95% confidence interval for the risk difference is (-0.052,0.152) –Is there a significant difference between the allele groups? What was the confidence interval for the difference between the proportions that we investigated two classes ago? What was the confidence interval for the difference between the proportions that we investigated two classes ago? –95% CI: (-0.052,0.152)

Interpretation of RR The estimated relative risk is The estimated relative risk is –The interpretation of this is that the risk of progression for patients with the susceptibility allele is 1.45 times higher than for patients without the allele The 95% confidence interval for the risk difference is (0.70,3.00) The 95% confidence interval for the risk difference is (0.70,3.00) –Is there a significant difference between the allele groups?

RD and RR Now that we know how to estimate these measures, can we estimate these with any study design? Now that we know how to estimate these measures, can we estimate these with any study design? –Not directly –In a cohort study study, the probabilities of interest, P(Disease|Exposure), are estimated –In a case-control study, the probabilities cannot be estimated directly so more information is required

Bayes theorem-technical The relationship between the P(Disease|Exposure) and P(Exposure| Disease) can be shown using Bayes theorem The relationship between the P(Disease|Exposure) and P(Exposure| Disease) can be shown using Bayes theorem Therefore, if we knew P(D+), we can estimate P(D+|E+) from a case control study Therefore, if we knew P(D+), we can estimate P(D+|E+) from a case control study –P(D+) is prevalence –Usually we do not know this so we can’t directly estimate the relative risk or risk difference

Odds ratio Odds: Odds: Odds ratio: Odds ratio: –Under the null, what is the OR?

Exposure DiseaseYNTotal Yab n1n1n1n1 Ncd n2n2n2n2 Total m1m1m1m1 m2m2m2m2N This is the estimate of the odds ratio from a cohort study

Exposure DiseaseYNTotal Yab n1n1n1n1 Ncd n2n2n2n2 Total m1m1m1m1 m2m2m2m2N This is the estimate of the odds ratio from a case-control study

Amazing!! Estimated odds ratio from each kind of study ends up being the same thing!!! Estimated odds ratio from each kind of study ends up being the same thing!!! Therefore, we can complete a case control study and get an estimate that we really care about, which is the effect of the exposure on the disease Therefore, we can complete a case control study and get an estimate that we really care about, which is the effect of the exposure on the disease This relationship is why the odds ratio is so commonly This relationship is why the odds ratio is so commonly

Confidence interval for OR In order to calculate a confidence interval for the OR, we will investigate Woolf’s approximation In order to calculate a confidence interval for the OR, we will investigate Woolf’s approximation –Other approximations and exact intervals are available in STATA (Exact is default) Woolf’s approximation focuses in a log transformation of the OR like for the RR Woolf’s approximation focuses in a log transformation of the OR like for the RR –log(OR)=ln(OR) Quick math: e ln(OR) =OR Quick math: e ln(OR) =OR

Woolf’s approximation gives us Woolf’s approximation gives us Using our normal approximation, we can create a confidence interval for ln(OR) using Using our normal approximation, we can create a confidence interval for ln(OR) using The confidence interval for OR The confidence interval for OR

Example In yesterday’s class, we discussed a study in which we wanted to estimate the effect of a SNP on disease progression In yesterday’s class, we discussed a study in which we wanted to estimate the effect of a SNP on disease progression –What type of study was this? –Cohort study because we followed people forward over time Let’s estimate the odds ratio and confidence interval for this study Let’s estimate the odds ratio and confidence interval for this study

CI for OR Based on this table, the estimated OR=(12*103)/(13*62)=1.53 Based on this table, the estimated OR=(12*103)/(13*62)= % CI: (0.66, 3.57) 95% CI: (0.66, 3.57) Should we reject the null hypothesis of OR=1? Should we reject the null hypothesis of OR=1? SNP+SNP-Total Prog No prog Total

Interpretation of OR The estimated odds ratio is The estimated odds ratio is –The interpretation of this is that the ODDS of progression for patients with the susceptibility allele is 1.53 times higher than the ODDS for patients without the allele The 95% confidence interval for the risk difference is (0.66,3.57) The 95% confidence interval for the risk difference is (0.66,3.57) –Is there a significant association between SNP and disease?

Estimated OREstimated CI (Woolf)

OR vs. RR Although the odds ratio is interesting, the relative risk is more intuitive Although the odds ratio is interesting, the relative risk is more intuitive If we have a rare disease, which is often the case for a case-control study, If we have a rare disease, which is often the case for a case-control study, Therefore, in these cases, the odds ratio is also an estimate of the relative risk Therefore, in these cases, the odds ratio is also an estimate of the relative risk In other cases, odds ratio provides valid estimate of relative risk (see other courses) In other cases, odds ratio provides valid estimate of relative risk (see other courses)

Hypothesis test with CI 1) H 0 : No association between SNP and progression (RD=0) 2) Dichotomous outcome, dichotomous predictor 3) Risk difference 95% confidence interval 4) Test statistic: Estimated RD= % CI: (-0.052, 0.152) 5) p-value>0.05 6) Since the p-value is greater than 0.05, we fail to reject the null hypothesis 7) We conclude that there is no significant association between SNP and progression

Hypothesis test with CI 1) H 0 : No association between SNP and progression (RR=1) 2) Dichotomous outcome, dichotomous predictor 3) Risk difference 95% confidence interval 4) Test statistic: Estimated RR= % CI: (0.70, 3.00) 5) p-value>0.05 6) Since the p-value is greater than 0.05, we fail to reject the null hypothesis 7) We conclude that there is no significant association between SNP and progression

Hypothesis test with CI 1) H 0 : No association between SNP and progression (OR=1) 2) Dichotomous outcome, dichotomous predictor 3) Risk difference 95% confidence interval 4) Test statistic: Estimated OR= % CI: (0.66, 3.57) 5) p-value>0.05 6) Since the p-value is greater than 0.05, we fail to reject the null hypothesis 7) We conclude that there is no significant association between SNP and progression