Parametric versus Non-parametric Genetic Association Analysis Kristel Van Steen, PhD, ScD Université de Liege - Institut Montefiore.

Parametric versus Non-parametric Genetic Association Analysis Kristel Van Steen, PhD, ScD (kristel.vansteen@ugent.be) Université de Liege - Institut Montefiore Ghent University – StepGen cvba December 18th, 2007

Genetic Association Studies Aim: Aim: detect association between one or more genetic polymorphisms and a trait, which may be measured, measured, dichotomous, dichotomous, time to onset. time to onset. (Genuine) Genetic associations arise only because human populations share common ancestry

Terminology (Roche Genetics)

Terminology

Terminology (Courtesy of Ed Silverman)

Genetic Association Studies Reflection I: In linkage analysis, data from distantly related individuals are more powerful for detecting small effects In linkage analysis, data from distantly related individuals are more powerful for detecting small effects  Increased possibility for linkage to be destroyed by recombination recombination  linkage extends over smaller distances  denser maps required

Linkage Disequilibrium (Roche Genetics)

Linkage Disequilibrium DDisease locus d Marker locus Marker locus 1 2 1 2 p D1 = p D p 1 pDpDpDpD pdpdpdpd p 1 p 1 p 2 p 2

Genetic Association Studies Reflection II: Association study is special form of linkage study: Association study is special form of linkage study: the extended family is the wider population Association studies have greater power than linkage studies to detect small effects, but require looking at more places Association studies have greater power than linkage studies to detect small effects, but require looking at more places (Risch and Merikangas 1996)

Genetic Association Studies Reflections III: Genetic susceptibility to common complex disorders involves many genes, most of which have small effects Genetic susceptibility to common complex disorders involves many genes, most of which have small effects A large number of “markers” have been identified A large number of “markers” have been identified

Complex Disorders (Roche Genetics)

Markers

Genetic Association DSL: disease susceptibility locus Disease Phenotype Test for genetic association between the phenotype and the DSL Marker LD / correlation Test for association between phenotype and marker locus

Indirect Associations The polymorphism is a surrogate for the causal locus: The polymorphism is a surrogate for the causal locus: Indirect associations are weaker than the direct associations they reflect Indirect associations are weaker than the direct associations they reflect Essential to type several surrounding markers Essential to type several surrounding markers Try to exclude the possibility that a causal variant exists but is not picked up by the marker set: Try to exclude the possibility that a causal variant exists but is not picked up by the marker set: Genome-wide vs Candidate gene approach Genome-wide vs Candidate gene approach

Statistical Requirements for a Successful Genome-wide Association Study  LD coverage  Genotyping quality  Sufficient sample sizes  Design of genome-wide association studies  Handling of the multiple testing problem

Study Designs (Cordell and Clayton, 2005)

Example for Required Sample Sizes Allele freq Odds ratio 1.251.51.75 0.18,8592,6081,350 0.25,2831,616869 0.34,2811,342727 0.43,8861,301750 Required sample sizes to achieve 80% power in a case/control study for a significance level of 10 -7

The interpretation of r^2 r 2 N is the “effective sample size” If a marker M and causal gene G are in LD, then a study with N cases and controls which measures M (but not G) will have the same power to detect an association as a study with r 2 N cases and controls that directly measured G So … The markers that are genotyped should be selected so that they have high r^2-values (preferable at least 80%) with the marker that are not genotyped A good SNPs selection will be key for the success of GWAs

Power – a Statistical Concept

Online Calculators General Statistical Calculators Including a Power Calculator (UCLA); General Statistical Calculators Including a Power Calculator (UCLA); Statistical Power Calculator for Frequencies; Statistical Power Calculator for Frequencies; Retrospective Power Calculation; Retrospective Power Calculation; Genetic Power Calculator; Genetic Power Calculator; Wise Project Applets: Power Applet; Wise Project Applets: Power Applet; Downloadable calculators: CaTS (Skol, 2006), Quanto (sample size or power calculation for association studies of genes, gene- environment or gene-gene interactions); Downloadable calculators: CaTS (Skol, 2006), Quanto (sample size or power calculation for association studies of genes, gene- environment or gene-gene interactions); Calculation of Power for Genetic Association Studies 'AssocPow' (Ambrosius, 2004), PS: Power and Sample Size Calculation; Calculation of Power for Genetic Association Studies 'AssocPow' (Ambrosius, 2004), PS: Power and Sample Size Calculation; Power & Sample Size Calculations on STATA. Power & Sample Size Calculations on STATA. (http://www.dorak.info/epi/glosge.html)

Type I and Type II errors

Statistical Analysis depends on Study Design... (Cordell and Clayton, 2005)

Statistical analysis depends on …

Assessing Association Direct association: Direct association: patterns of genotype-phenotype relationship From dose-response models to models accounting for epistatic effects From dose-response models to models accounting for epistatic effects Indirect association: Indirect association: patterns of linkage-disequilibrium r 2 relates to the power to detect association: ss 0.56/0.2 (2.8) times as large to detect indirect association with A than indirect association with C r 2 relates to the power to detect association: ss 0.56/0.2 (2.8) times as large to detect indirect association with A than indirect association with C Haplotype blocks / haplotype tagging SNPs Haplotype blocks / haplotype tagging SNPs 10.2 10.56 1 A B C ABC r squared measures of LD; Locus B is assumed to be causal

Human Genetic Disorders Single gene disorder Less than 0.05% (rare), e.g., Huntington disease, cystic fibrosis Disorders with polygenic or multifactorial inheritance 1% or more (common); e.g., diabetes, obesity Do not show Mendelian modes of transmission Genetically relevant phenotype often unclear Under the influence of multiple interacting genes

Mendelian Traits Aa BBbb AA aa BBBbbb Aa AAAa BBbbBb Locus 1 Locus 2 AABBAABbAAbb AaBBAaBbAabb aaBBaaBbaabb affected

Complex Traits Aa BBBb AA aa BBBbbb aa AAAa BBbbBb Locus 1 Locus 2 AABBAABbAAbb AaBBAaBbAabb aaBBaaBbaabb affected

Genetic Etiology I Disease Any one bad gene results in the disease. Genes have no effect on each other. Independent effect Gene1 Gene2 Gene3 Gene4 Gene5 Genetic Heterogeneity

Genetic Etiology II Disease E.g. Any bad gene results in disease. Genes have an effect on other genes in the pathway. Interactive effect Gene1 Gene2 Gene3 Epistasis

Genetic Etiology III Some individuals with genotype do not manifest trait. No Disease Disease No Disease Disease Gene1 Gene1 Incomplete penetrance

Genetic Etiology IV Disease AA Maybe caused by environmental factors Phenocopy Assuming a dominant model, and disease allele A, normal allele a. aa AaAa AA

And now we should be able to start modeling, testing, estimating, …

Association Analysis Case-control studies Case-control studies Test for association between marker alleles and the disease phenotype in a group of affected and unaffected individuals randomly from the population Test for association between marker alleles and the disease phenotype in a group of affected and unaffected individuals randomly from the population Family-based studies Family-based studies Test for association between marker alleles and the disease phenotype in a group of affected individuals and unaffected family members Test for association between marker alleles and the disease phenotype in a group of affected individuals and unaffected family members

Case-control data structure StatusSNP1SNP2SNP3SNP4SNP5SNP6SNP7SNP8SNP9SNP10 11221212212 10001000010 10201102011 12011020110 12110021100 11000010000 11101211012 11010210102 10002000020 10010100101 02101021010 00110001100 01102111021 00020100201 02101121011 00020000200 01001210012 00111201112 01100211002 00120001200

0 1 2 Total Case r 0 r 1 r 2 R Control s 0 s 1 s 2 S Total n 0 n 1 n 2 N Standard Method: Genotype Case-Control The Bonferroni correction for multiple comparisons 0.05/(# SNPs tested) (Gibson and Muse, 2002) # copies of ‘0’ allele

A Pure Epistatic Inheritance Model Comparison of allele or genotype frequencies between cases and controls will not show anything unusual. AAAaaa BB000.2 Bb00.20 bb0.200 Virtually no power!Marginal0.2 0.2 0.2 Marginal0.20.20.2 p = 0.5 q = 0.5

Traditional Method suffers A large number of SNPs are genotyped A large number of SNPs are genotyped “multiple comparisons” problem, very small p-values required for significance. “multiple comparisons” problem, very small p-values required for significance. Genetic loci may interact (epistasis) in their influence on the phenotype Genetic loci may interact (epistasis) in their influence on the phenotype loci with small marginal effects may go undetected loci with small marginal effects may go undetected interested in the interaction itself interested in the interaction itself

N = 100 50 Cases, 50 Controls AA Aaaa BB Bb bb CCCccc DD Dd dd AA Aaaa AA Aaaa BB Bb bb BB Bb bb SNP 1 SNP 2 SNP 4 SNP 3 Curse of Dimensionality

Bellman R (1961) Adaptive control processes: A guided tour. Princeton University Press: Bellman R (1961) Adaptive control processes: A guided tour. Princeton University Press: “... Multidimensional variational problems cannot be solved routinely.... This does not mean that we cannot attack them. It merely means that we must employ some more sophisticated techniques.”

Traditional Methods suffer Tree-based methods: Tree-based methods: Recursive Partitioning (Helix Tree) Recursive Partitioning (Helix Tree) Random Forests (R, CART) Random Forests (R, CART) Pattern recognition methods: Pattern recognition methods: Symbolic Discriminant Analysis (SDA) Symbolic Discriminant Analysis (SDA) Mining association rules Mining association rules Neural networks (NN) Neural networks (NN) Support vector machines (SVM) Support vector machines (SVM) Data reduction methods: Data reduction methods: DICE (Detection of Informative Combined Effects) DICE (Detection of Informative Combined Effects) MDR (Multifactor Dimensionality Reduction) MDR (Multifactor Dimensionality Reduction) Logic regression … Logic regression … (e.g., Onkamo and Toivonen 2006) Alternatives

Type of data Qualitative (categorical) 1 independent variable 2 independent variables Quantitative (measurement) Relationships Differences 2 groups Multiple groups Nonparametric Parametric 2 dependent variables Goodness of fit x 2 Independence test x 2 1 predictor Multiple predictors Continuous measurement Ranks Multiple regression Spearman r s Primary interest Degree of relationship Form of relationship Pearson r Regression independent dependent 2-sample t Mann-Whitney U Related sample t Wilcoxon T 1 IV Multiple IVs independent dependent One-way ANOVA Kruskal-Wallis H Factorial ANOVA Repeated measures ANOVA Friedman McNemar test Hypothesis Testing

Multi-locus Methods Parametric methods: Regression Logistic or (Bagged) logic regression Non-parametric methods: Combinatorial Partitioning Method (CPM) quantitative phenotypes; interactions Multifactor-Dimensionality Reduction (MDR) qualitative phenotypes; interactions Machine learning and data mining

Limitation of Regression Having too many Having too many independent variables in relation to the number of observed outcome events Assuming 10 bi-allelic loci: # of Parameters = Main effect 2-locus interaction 3-locus interaction 4-locus interaction # of Parameters 201809603360

Limitation of Regression Fewer Fewer than 10 outcome events per independent variable can lead to biased estimates of the regression coefficients and to an increase in Type 1 and Type 2 errors. For 200 cases and 200 controls, this formula suggests that no more than 19 (= 200/10 – 1) parameters should be estimated in logistic regression model. # of parameters P  min(n case, n control )/10 - 1

MDR An extension of CPM, which finds the genotype partitions within which a (quantitative) trait variability is much lower than between partitions An extension of CPM, which finds the genotype partitions within which a (quantitative) trait variability is much lower than between partitions MDR MDR reduces the dimensionality of multi-locus information to one-dimension, thereby improving the identification of polymorphism combinations associated with disease risk The one-dimensional multi-locus genotype variable is evaluated for its ability to classify and predict disease status through cross-validation and permutation testing

Two Measures for Selection of Best n-locus model Misclassification error: Misclassification error: The proportion of incorrect classification in the training set. Prediction error (PE): Prediction error (PE): The proportion of incorrect prediction in the test set.

MDR Steps All combinations of 2 factors = 10*9/2 = 45 A single model with minimum classification error is the best Model 9/10 training data 1/10 test data 10 runs 10 cross-validation  10 best models. The model with minimum PE is the best n-locus model.

Best Multi-factor Models Best 2-factor model Best 3-factor model Best 4-factor model Best 5-factor model Best 6-factor model. Best n-factor model

Model Selection and Evaluation Among the Among the best n-factor models, the best model is: The model with the minimum average PE. The model with the maximum average CVC. Rule of parsimony: If there is a tie, select the smaller model.

MDR Analysis Window (MDR_Overview.pdf)

Significance of the Final Model Via permutation tests: Via permutation tests: Randomize the Randomize the the case and control labels in the original dataset multiple times to create a set of permuted datasets. Run MDR Run MDR on each permuted dataset. Maximum Maximum CVC and minimum PE identified for each dataset saved and used to create an empirical distribution for estimation of a P-value.

Measures in Selection of Final model Cross-validation consistency (CVC) Cross-validation consistency (CVC) In every run, In every run, # of times the same MDR model is identified in m cross-validation.1  CVC  m. Average cross-validation consistency Average cross-validation consistency Average of Average of CVC across all runs. Average misclassification error Average misclassification error Average across Average across all cross-validations and all runs. Average prediction error Average prediction error Average prediction error Average prediction error across all cross-validations and all runs.

200 cases and 200 controls; 10 SNPs: 1, 2, 3, …, 10. Disease etiology due to interaction between SNP 1 and SNP 6. Simulation I Over 10 CVs and 10 runs

Simulation II 50 replicates of 200 cases and 200 controls; 10 SNPs: 1, 2, 3, …, 10. Disease risk is dependent on whether two deleterious alleles and two normal alleles are present, from either one locus or both loci. 2-locus epistatis model; 3-locus epistatis model; 4-locus epistatis model; 5-locus epistatis model.

Mean and standard error of the mean calculated from 50 replicates. Power 78% 82% 94% 90% (Ritchie et al, 2001)

Power of MDR in Presence of Genotyping Error, Missing Data, Phenocopy, and Genetic Heterogeneity no noise 5% genotyping error -- GE 5% missing data -- MS 50% phenocopy -- PC 50% genetic heterogeneity – GH GE + MS … GE+MS+PC … GE+MS+PC+GH 6 models 4 models Total 16 models

Advantages of MDR Simultaneous detection Simultaneous detection of multiple genetic loci associated with a discrete clinical endpoint in absence of main effect. Non-parametric: Non-parametric: Overcomes “curse of dimensionality” from which logistic regression models suffer. No particular genetic model No particular genetic model Low false positive rates Low false positive rates

Disadvantages of MDR Computationally very intensive. Computationally very intensive. Only feasible for relatively small number of factors. Impractical to test very high-dimensional models. When When the dimensionality of the best model is relatively high and the sample is relatively small, many observations in the test set can not be predicted. This impacts the SEM of prediction error. Low power in the presence of heterogeneity Low power in the presence of heterogeneity

Issues to Consider I: Variable selection I: Variable selection II: Model selection II: Model selection III: Interpretation III: Interpretation

I: Variable Selection How can you determine which variables to select? How can you determine which variables to select? Not computationally feasible to evaluate all possible combinations Not computationally feasible to evaluate all possible combinations Need to select correct variables to detect interactions Need to select correct variables to detect interactions

How many combinations are there? ~500,000 SNPs span 80% of common variation in genome (HapMap) SNPs in each subset 5 x 10 5 2 x 10 16 1 x 10 11 3 x 10 21 2 x 10 26 Number of Possible Combinations

II: Model Selection For each variable subset, evaluate a statistical model For each variable subset, evaluate a statistical model Goal is to identify the best subset of variables that compose the best model Goal is to identify the best subset of variables that compose the best model

III: Interpretation Selection of best statistical model in a vast search space of possible models Selection of best statistical model in a vast search space of possible models Statistical or computational model may not translate into biology Statistical or computational model may not translate into biology May not be able to identify prevention or treatment strategies directly May not be able to identify prevention or treatment strategies directly Wet lab experiments will be necessary, but may not be sufficient Wet lab experiments will be necessary, but may not be sufficient

Interpretation Strategies to assess biological interpretation of gene-gene interaction models Strategies to assess biological interpretation of gene-gene interaction models Consider current knowledge about the biochemistry of the system and the biological plausibility of the models Consider current knowledge about the biochemistry of the system and the biological plausibility of the models Perform experiments in the wet lab to measure the effect of small perturbations to the system Perform experiments in the wet lab to measure the effect of small perturbations to the system Computer simulation algorithms to model biochemical systems Computer simulation algorithms to model biochemical systems

MDR: To keep in Mind Candidate SNP selection: Candidate SNP selection: The selection of final model is highly dependent on the selection of n factors at the beginning. Selection of the best n-factor model: Selection of the best n-factor model: Keeping one best n-factor model from all combinations is actually a greedy search algorithm, which might lead to local maximum; yet nice power results and practice has proven its usefulness. Performance when heterogeneity is present in the data: Performance when heterogeneity is present in the data: Phenotypic (diff clinical expressions), genetic (diff inheritance patterns), locus (diff genes), allelic (diff alleles in same gene)

References for MDR Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. Multifactor- dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001 Jul;69(1):138-47. Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. Multifactor- dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001 Jul;69(1):138-47. Ritchie MD, Hahn LW, Moore JH. Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol. 2003 Feb;24(2):150-7. Ritchie MD, Hahn LW, Moore JH. Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol. 2003 Feb;24(2):150-7. Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting gene- gene and gene-environment interactions. Bioinformatics. 2003 Feb 12;19(3):376-82. Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting gene- gene and gene-environment interactions. Bioinformatics. 2003 Feb 12;19(3):376-82. Moore JH. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum Hered. 2003;56(1-3):73-82. Moore JH. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum Hered. 2003;56(1-3):73-82. Cho YM, Ritchie MD, Moore JH, Park JY, Lee KU, Shin HD, Lee HK, Park KS. Multifactor- dimensionality reduction shows a two-locus interaction associated with Type 2 diabetes mellitus. Diabetologia. 2004 Mar;47(3):549-54. Cho YM, Ritchie MD, Moore JH, Park JY, Lee KU, Shin HD, Lee HK, Park KS. Multifactor- dimensionality reduction shows a two-locus interaction associated with Type 2 diabetes mellitus. Diabetologia. 2004 Mar;47(3):549-54. Ritchie MD, Motsinger AA. Multifactor dimensionality reduction for detecting gene-gene and gene- environment interactions in pharmacogenomics studies. Pharmacogenomics. 2005 Dec;6(8):823- 34. Ritchie MD, Motsinger AA. Multifactor dimensionality reduction for detecting gene-gene and gene- environment interactions in pharmacogenomics studies. Pharmacogenomics. 2005 Dec;6(8):823- 34. Martin ER, Ritchie MD, Hahn L, Kang S, Moore JH. A novel method to identify gene-gene effects in nuclear families: the MDR-PDT. Genet Epidemiol. 2006 Feb;30(2):111-23. Martin ER, Ritchie MD, Hahn L, Kang S, Moore JH. A novel method to identify gene-gene effects in nuclear families: the MDR-PDT. Genet Epidemiol. 2006 Feb;30(2):111-23. Andrew AS, Nelson HH, Kelsey KT, Moore JH, Meng AC, Casella DP, Tosteson TD, Schned AR, Karagas MR. Concordance of multiple analytical approaches demonstrates a complex relationship between DNA repair gene SNPs, smoking and bladder cancer susceptibility. Carcinogenesis. 2006 May;27(5):1030-7. Andrew AS, Nelson HH, Kelsey KT, Moore JH, Meng AC, Casella DP, Tosteson TD, Schned AR, Karagas MR. Concordance of multiple analytical approaches demonstrates a complex relationship between DNA repair gene SNPs, smoking and bladder cancer susceptibility. Carcinogenesis. 2006 May;27(5):1030-7. Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006 Jul 21;241(2):252-61. Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006 Jul 21;241(2):252-61.

Acknowledgements Slides content based on material from Jie Chen, Frank Emmert-Streib, Earl F Glynn, Hua Li, Bolan Linghu, Arcady R Mushegian, Yan Meng, Jurg Ott, Marylyn Ritchie, Antonio Salas, Chris Seidel, Matt McQueen, Christoph Lange and discussions with Steve Horvath, Nan M. Laird, Stephen Lake, Christoph Lange, Ross Lazarus, Matthew McQueen, Benjamin Raby, Nuria Malats, Marylyn Ritchie (lab), Edwin K. Silverman, Scott T. Weiss, Xin Xu, …

Parametric versus Non-parametric Genetic Association Analysis Kristel Van Steen, PhD, ScD Université de Liege - Institut Montefiore.

Similar presentations

Presentation on theme: "Parametric versus Non-parametric Genetic Association Analysis Kristel Van Steen, PhD, ScD Université de Liege - Institut Montefiore."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parametric versus Non-parametric Genetic Association Analysis Kristel Van Steen, PhD, ScD Université de Liege - Institut Montefiore.

Similar presentations

Presentation on theme: "Parametric versus Non-parametric Genetic Association Analysis Kristel Van Steen, PhD, ScD Université de Liege - Institut Montefiore."— Presentation transcript:

Similar presentations

About project

Feedback