More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.

Slides:



Advertisements
Similar presentations
Tests of Hypotheses Based on a Single Sample
Advertisements

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Association Tests for Rare Variants Using Sequence Data
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Basics of Linkage Analysis
BMI 731- Winter 2005 Chapter1: SNP Analysis Catalin Barbacioru Department of Biomedical Informatics Ohio State University.
MALD Mapping by Admixture Linkage Disequilibrium.
Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical.
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Chapter 4 Discrete Random Variables and Probability Distributions
Statistical Methods Chichang Jou Tamkang University.
1/55 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 10 Hypothesis Testing.
Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.
PSY 1950 Confidence and Power December, Requisite Quote “The picturing of data allows us to be sensitive not only to the multiple hypotheses that.
Chapter 11 Multiple Regression.
Topic 3: Regression.
PY 427 Statistics 1Fall 2006 Kin Ching Kong, Ph.D Lecture 6 Chicago School of Professional Psychology.
Today Concepts underlying inferential statistics
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
Correlation & Regression
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Lecture 16 Correlation and Coefficient of Correlation
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 12 Analyzing the Association Between Quantitative Variables: Regression Analysis Section.
Simple Linear Regression
Regression Part II One-factor ANOVA Another dummy variable coding scheme Contrasts Multiple comparisons Interactions.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
A single-nucleotide polymorphism tagging set for human drug metabolism and transport Kourosh R Ahmadi, Mike E Weale, Zhengyu Y Xue, Nicole Soranzo, David.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Hypothesis Testing A procedure for determining which of two (or more) mutually exclusive statements is more likely true We classify hypothesis tests in.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Next Colin Clarke-Hill and Ismo Kuhanen 1 Analysing Quantitative Data 1 Forming the Hypothesis Inferential Methods - an overview Research Methods Analysing.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Section 9-1: Inference for Slope and Correlation Section 9-3: Confidence and Prediction Intervals Visit the Maths Study Centre.
C M Clarke-Hill1 Analysing Quantitative Data Forming the Hypothesis Inferential Methods - an overview Research Methods.
Quantitative Genetics. Continuous phenotypic variation within populations- not discrete characters Phenotypic variation due to both genetic and environmental.
Type 1 Error and Power Calculation for Association Analysis Pak Sham & Shaun Purcell Advanced Workshop Boulder, CO, 2005.
Quantitative Genetics
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 14 Comparing Groups: Analysis of Variance Methods Section 14.3 Two-Way ANOVA.
General Linear Model.
Lecture 21: Quantitative Traits I Date: 11/05/02  Review: covariance, regression, etc  Introduction to quantitative genetics.
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Association tests. Basics of association testing Consider the evolutionary history of individuals proximal to the disease carrying mutation.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
3.3. SIMPLE LINEAR REGRESSION: DUMMY VARIABLES 1 Design and Data Analysis in Psychology II Salvador Chacón Moscoso Susana Sanduvete Chaves.
Date of download: 7/2/2016 Copyright © 2016 American Medical Association. All rights reserved. From: How to Interpret a Genome-wide Association Study JAMA.
Estimating standard error using bootstrap
Inference about the slope parameter and correlation
Chapter 7. Classification and Prediction
Genome Wide Association Studies using SNP
6-1 Introduction To Empirical Models
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants  Andrew.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio

SINGLE-MARKER AND TWO-MARKER ASSOCIATION TESTS FOR UNPHASED CASE-CONTROL GENOTYPE DATA, WITH A POWER COMPARISON Kim S, Morris NJ, Won S, Elston RC Genetic Epidemiology, in press 2

Introduction A genome-wide association study with case-control data aims to localize disease susceptibility regions in the genome A genome-wide association study with case-control data aims to localize disease susceptibility regions in the genome Single Nucleotide Polymorphism (SNP) markers, which are usually diallelic, have been used to cover the whole genome Single Nucleotide Polymorphism (SNP) markers, which are usually diallelic, have been used to cover the whole genome Two categories of tests have been applied to these data Two categories of tests have been applied to these data single marker association tests, which examine association between affection status and the SNP data one SNP at a time single marker association tests, which examine association between affection status and the SNP data one SNP at a time multi-marker association tests, which examine association between affection status and multiple SNP data simultaneously multi-marker association tests, which examine association between affection status and multiple SNP data simultaneously 3

Allele HWDLD Association Analysis Information for association g. phase-known genotype- based test a bc de f g a.Allele frequency trend test b.HWD trend test c.LD contrast test d. genotype frequency test e.haplotype-based test with HWE f. ???

The allele frequency, HWD and LD contrast tests are typically developed in what has been termed a retrospective context; i.e. case-control status is considered fixed and the genotypes are considered random The allele frequency, HWD and LD contrast tests are typically developed in what has been termed a retrospective context; i.e. case-control status is considered fixed and the genotypes are considered random For case-control data, epidemiologists typically take advantage of the properties of the odds ratio and use the prospective logistic regression model, making the case-control status the random variable dependent on the predictors For case-control data, epidemiologists typically take advantage of the properties of the odds ratio and use the prospective logistic regression model, making the case-control status the random variable dependent on the predictors Prospective modeling tends to allow for greater flexibility, especially when adjusting for covariates Prospective modeling tends to allow for greater flexibility, especially when adjusting for covariates It also provides a natural way to adjust for any correlations between the tests or other covariates, and can be extended to quantitative traits It also provides a natural way to adjust for any correlations between the tests or other covariates, and can be extended to quantitative traits 5

Notation and Assumptions We suppose there are two diallelic SNP markers, A and B having alleles {A 1,A 2 } and {B 1,B 2 }, respectively, where A 1 and B 1 are the minor alleles We suppose there are two diallelic SNP markers, A and B having alleles {A 1,A 2 } and {B 1,B 2 }, respectively, where A 1 and B 1 are the minor alleles X = 1 for A 1 A 1 1 for B 1 B 1 0 for A 1 A 2, Y = 0 for B 1 B 2 for A 2 A 2 for B 2 B 2 I case and I ctrl denote the sets of cases and controls I case and I ctrl denote the sets of cases and controls We make minimal assumptions about the general population sampled; in particular, we do not assume HWE in the population We make minimal assumptions about the general population sampled; in particular, we do not assume HWE in the population μ X, and σ XY denote the expected value of X, the variance of X and the covariance of X and Y, respectively μ X, and σ XY denote the expected value of X, the variance of X and the covariance of X and Y, respectively 6

The HWD parameter for marker A is given by The HWD parameter for marker A is given by The HWD parameter can be expressed as The HWD parameter can be expressed as This means that the HWD parameter, d A, is half the deviation of the variance from the variance expected under HWE This means that the HWD parameter, d A, is half the deviation of the variance from the variance expected under HWE The composite LD parameter for alleles A 1 and B 1 of markers A and B is The composite LD parameter for alleles A 1 and B 1 of markers A and B is 7

Probabilities for unphased genotypes 8

The joint test of allele frequency and HWD contrasts between cases and controls tests the null hypothesis H 0 : (p A|case d A|case ) = (p A|ctrl d A|ctrl ) The joint test of allele frequency and HWD contrasts between cases and controls tests the null hypothesis H 0 : (p A|case d A|case ) = (p A|ctrl d A|ctrl ) Let Z i = (X i )’; the sample mean Z is a sufficient statistic for (p A d A )’ Let Z i = (X i )’; the sample mean Z is a sufficient statistic for (p A d A )’ The Allelic-HWD contrast test can be performed by comparing Z case and Z ctrl. The T 2 statistic for this test is The Allelic-HWD contrast test can be performed by comparing Z case and Z ctrl. The T 2 statistic for this test is _ _ _ 9

Let Z i = (X i Y i X i Y i )’; is a sufficient statistic for (p A p B Δ) ’ Let Z i = (X i Y i X i Y i )’; is a sufficient statistic for (p A p B Δ) ’ Z _ The Allelic-LD contrast test can be performed using a version of Hotelling’s T 2 The Allelic-LD contrast test can be performed using a version of Hotelling’s T 2 The additional case-control differences can be captured by the HWD and LD contrast tests, given the allele frequency contrast(s) The additional case-control differences can be captured by the HWD and LD contrast tests, given the allele frequency contrast(s) The Allelic-HWD-LD contrast test can be constructed in a similar manner by contrasting the mean vector of Z i = (X i Y i X i Y i )’ between cases and controls The Allelic-HWD-LD contrast test can be constructed in a similar manner by contrasting the mean vector of Z i = (X i Y i X i Y i )’ between cases and controls 10

Single-marker and two-marker association tests with corresponding models and hypotheses 11

Multistage Tests “Self-replication” if the tests are independent “Self-replication” if the tests are independent Sequential tests Sequential tests E.g. The HWD contrast test adjusted for allele frequency information which is used in the first stage can be performed by the test of 12

Penetrance Model and True Marker Association Model Let D denote the disease genotype variable coded as Let D denote the disease genotype variable coded as D = 1 for D 1 D 1 0 for D 1 D 2 for D 2 D 2 We write the penetrance model as: We write the penetrance model as: 13

Constraints for disease models 14

Given the true disease model and the LD structure, we can set up the true single-marker association model between the phenotype and single-marker data X: Given the true disease model and the LD structure, we can set up the true single-marker association model between the phenotype and single-marker data X: This true association model has the same form as the penetrance model This true association model has the same form as the penetrance model When (1 – 2p D ) - ≠ 0, the coefficient of the When (1 – 2p D ) - ≠ 0, the coefficient of the quadratic terms generally approaches 0 faster than does that of the linear term quadratic terms generally approaches 0 faster than does that of the linear term γDγDγD2γD2γDγDγD2γD2 15

Power Computation T 2 test in a retrospective model and the score test and LRT in a prospective logistic model are expected to perform similarly T 2 test in a retrospective model and the score test and LRT in a prospective logistic model are expected to perform similarly The noncentrality parameter of the T 2 test for test 2-5 is The noncentrality parameter of the T 2 test for test 2-5 is 16 The noncentrality parameters for the other tests can be obtained by using the corresponding sub-matrices of (μ case – μ ctrl ) and (Σ case + Σ ctrl ) The noncentrality parameters for the other tests can be obtained by using the corresponding sub-matrices of (μ case – μ ctrl ) and (Σ case + Σ ctrl ) Then Then

Comparisons of theoretical and empirical power of test 1-2 For each of the four disease models, parameters were set as follows: p D = 0.2, p A = 0.3, K = 0.05, D XD = 0.048(D’ = 0.8), n = 2,000 (500 for recessive), α = 0.05/500,000 Empirical power is obtained by the ratio of the number of rejected replicates to the total 100,000 replicates. 17

18

Power comparisons of two-marker tests 19

Power Comparisons on Real Data We estimated LD parameters and marker allele frequencies from the HapMap CEU population We estimated LD parameters and marker allele frequencies from the HapMap CEU population The data consist of 120 haplotypes estimated from 30 parent-offspring trios The data consist of 120 haplotypes estimated from 30 parent-offspring trios We split chromosome 11 into mutually exclusive consecutive regions containing 3 SNPs each We split chromosome 11 into mutually exclusive consecutive regions containing 3 SNPs each For each region we estimated the LD and allele frequency parameters For each region we estimated the LD and allele frequency parameters We excluded regions where the minor allele frequencies of three consecutive markers were less than 0.1, leaving 4,648 regions We excluded regions where the minor allele frequencies of three consecutive markers were less than 0.1, leaving 4,648 regions We chose the disease SNP to be the one with the smallest allele frequency We chose the disease SNP to be the one with the smallest allele frequency Parameters other than the allele frequency and LD parameters were set to be the same as before Parameters other than the allele frequency and LD parameters were set to be the same as before 20

Mean of power over chromosome 11 of CEU HapMap data 21

Conclusions The best two marker test always appear to be more powerful than either the best single- marker test or the haplotype-based test The best two marker test always appear to be more powerful than either the best single- marker test or the haplotype-based test It should be possible, by examining the LD structure of the markers, to predict which will be the best two-marker test to perform It should be possible, by examining the LD structure of the markers, to predict which will be the best two-marker test to perform We need to study > two marker tests We need to study > two marker tests 22