Download presentation
Presentation is loading. Please wait.
1
Parametric versus Non-parametric Genetic Association Analysis Kristel Van Steen, PhD, ScD (kristel.vansteen@ugent.be) Université de Liege - Institut Montefiore Ghent University – StepGen cvba December 18th, 2007
2
Genetic Association Studies Aim: Aim: detect association between one or more genetic polymorphisms and a trait, which may be measured, measured, dichotomous, dichotomous, time to onset. time to onset. (Genuine) Genetic associations arise only because human populations share common ancestry
3
Terminology (Roche Genetics)
4
Terminology
5
Terminology (Courtesy of Ed Silverman)
6
Genetic Association Studies Reflection I: In linkage analysis, data from distantly related individuals are more powerful for detecting small effects In linkage analysis, data from distantly related individuals are more powerful for detecting small effects Increased possibility for linkage to be destroyed by recombination recombination linkage extends over smaller distances denser maps required
7
Linkage Disequilibrium (Roche Genetics)
8
Linkage Disequilibrium DDisease locus d Marker locus Marker locus 1 2 1 2 p D1 = p D p 1 pDpDpDpD pdpdpdpd p 1 p 1 p 2 p 2
9
Genetic Association Studies Reflection II: Association study is special form of linkage study: Association study is special form of linkage study: the extended family is the wider population Association studies have greater power than linkage studies to detect small effects, but require looking at more places Association studies have greater power than linkage studies to detect small effects, but require looking at more places (Risch and Merikangas 1996)
10
Genetic Association Studies Reflections III: Genetic susceptibility to common complex disorders involves many genes, most of which have small effects Genetic susceptibility to common complex disorders involves many genes, most of which have small effects A large number of “markers” have been identified A large number of “markers” have been identified
11
Complex Disorders (Roche Genetics)
12
Markers
13
Genetic Association DSL: disease susceptibility locus Disease Phenotype Test for genetic association between the phenotype and the DSL Marker LD / correlation Test for association between phenotype and marker locus
14
Indirect Associations The polymorphism is a surrogate for the causal locus: The polymorphism is a surrogate for the causal locus: Indirect associations are weaker than the direct associations they reflect Indirect associations are weaker than the direct associations they reflect Essential to type several surrounding markers Essential to type several surrounding markers Try to exclude the possibility that a causal variant exists but is not picked up by the marker set: Try to exclude the possibility that a causal variant exists but is not picked up by the marker set: Genome-wide vs Candidate gene approach Genome-wide vs Candidate gene approach
15
Statistical Requirements for a Successful Genome-wide Association Study LD coverage Genotyping quality Sufficient sample sizes Design of genome-wide association studies Handling of the multiple testing problem
16
Study Designs (Cordell and Clayton, 2005)
17
Example for Required Sample Sizes Allele freq Odds ratio 1.251.51.75 0.18,8592,6081,350 0.25,2831,616869 0.34,2811,342727 0.43,8861,301750 Required sample sizes to achieve 80% power in a case/control study for a significance level of 10 -7
18
The interpretation of r^2 r 2 N is the “effective sample size” If a marker M and causal gene G are in LD, then a study with N cases and controls which measures M (but not G) will have the same power to detect an association as a study with r 2 N cases and controls that directly measured G So … The markers that are genotyped should be selected so that they have high r^2-values (preferable at least 80%) with the marker that are not genotyped A good SNPs selection will be key for the success of GWAs
19
Power – a Statistical Concept
20
Online Calculators General Statistical Calculators Including a Power Calculator (UCLA); General Statistical Calculators Including a Power Calculator (UCLA); Statistical Power Calculator for Frequencies; Statistical Power Calculator for Frequencies; Retrospective Power Calculation; Retrospective Power Calculation; Genetic Power Calculator; Genetic Power Calculator; Wise Project Applets: Power Applet; Wise Project Applets: Power Applet; Downloadable calculators: CaTS (Skol, 2006), Quanto (sample size or power calculation for association studies of genes, gene- environment or gene-gene interactions); Downloadable calculators: CaTS (Skol, 2006), Quanto (sample size or power calculation for association studies of genes, gene- environment or gene-gene interactions); Calculation of Power for Genetic Association Studies 'AssocPow' (Ambrosius, 2004), PS: Power and Sample Size Calculation; Calculation of Power for Genetic Association Studies 'AssocPow' (Ambrosius, 2004), PS: Power and Sample Size Calculation; Power & Sample Size Calculations on STATA. Power & Sample Size Calculations on STATA. (http://www.dorak.info/epi/glosge.html)
21
Type I and Type II errors
22
Statistical Analysis depends on Study Design... (Cordell and Clayton, 2005)
23
Statistical analysis depends on …
24
Assessing Association Direct association: Direct association: patterns of genotype-phenotype relationship From dose-response models to models accounting for epistatic effects From dose-response models to models accounting for epistatic effects Indirect association: Indirect association: patterns of linkage-disequilibrium r 2 relates to the power to detect association: ss 0.56/0.2 (2.8) times as large to detect indirect association with A than indirect association with C r 2 relates to the power to detect association: ss 0.56/0.2 (2.8) times as large to detect indirect association with A than indirect association with C Haplotype blocks / haplotype tagging SNPs Haplotype blocks / haplotype tagging SNPs 10.2 10.56 1 A B C ABC r squared measures of LD; Locus B is assumed to be causal
25
Human Genetic Disorders Single gene disorder Less than 0.05% (rare), e.g., Huntington disease, cystic fibrosis Disorders with polygenic or multifactorial inheritance 1% or more (common); e.g., diabetes, obesity Do not show Mendelian modes of transmission Genetically relevant phenotype often unclear Under the influence of multiple interacting genes
26
Mendelian Traits Aa BBbb AA aa BBBbbb Aa AAAa BBbbBb Locus 1 Locus 2 AABBAABbAAbb AaBBAaBbAabb aaBBaaBbaabb affected
27
Complex Traits Aa BBBb AA aa BBBbbb aa AAAa BBbbBb Locus 1 Locus 2 AABBAABbAAbb AaBBAaBbAabb aaBBaaBbaabb affected
28
Genetic Etiology I Disease Any one bad gene results in the disease. Genes have no effect on each other. Independent effect Gene1 Gene2 Gene3 Gene4 Gene5 Genetic Heterogeneity
29
Genetic Etiology II Disease E.g. Any bad gene results in disease. Genes have an effect on other genes in the pathway. Interactive effect Gene1 Gene2 Gene3 Epistasis
30
Genetic Etiology III Some individuals with genotype do not manifest trait. No Disease Disease No Disease Disease Gene1 Gene1 Incomplete penetrance
31
Genetic Etiology IV Disease AA Maybe caused by environmental factors Phenocopy Assuming a dominant model, and disease allele A, normal allele a. aa AaAa AA
32
And now we should be able to start modeling, testing, estimating, …
33
Association Analysis Case-control studies Case-control studies Test for association between marker alleles and the disease phenotype in a group of affected and unaffected individuals randomly from the population Test for association between marker alleles and the disease phenotype in a group of affected and unaffected individuals randomly from the population Family-based studies Family-based studies Test for association between marker alleles and the disease phenotype in a group of affected individuals and unaffected family members Test for association between marker alleles and the disease phenotype in a group of affected individuals and unaffected family members
34
Case-control data structure StatusSNP1SNP2SNP3SNP4SNP5SNP6SNP7SNP8SNP9SNP10 11221212212 10001000010 10201102011 12011020110 12110021100 11000010000 11101211012 11010210102 10002000020 10010100101 02101021010 00110001100 01102111021 00020100201 02101121011 00020000200 01001210012 00111201112 01100211002 00120001200
35
0 1 2 Total Case r 0 r 1 r 2 R Control s 0 s 1 s 2 S Total n 0 n 1 n 2 N Standard Method: Genotype Case-Control The Bonferroni correction for multiple comparisons 0.05/(# SNPs tested) (Gibson and Muse, 2002) # copies of ‘0’ allele
36
A Pure Epistatic Inheritance Model Comparison of allele or genotype frequencies between cases and controls will not show anything unusual. AAAaaa BB000.2 Bb00.20 bb0.200 Virtually no power!Marginal0.2 0.2 0.2 Marginal0.20.20.2 p = 0.5 q = 0.5
37
Traditional Method suffers A large number of SNPs are genotyped A large number of SNPs are genotyped “multiple comparisons” problem, very small p-values required for significance. “multiple comparisons” problem, very small p-values required for significance. Genetic loci may interact (epistasis) in their influence on the phenotype Genetic loci may interact (epistasis) in their influence on the phenotype loci with small marginal effects may go undetected loci with small marginal effects may go undetected interested in the interaction itself interested in the interaction itself
38
N = 100 50 Cases, 50 Controls AA Aaaa BB Bb bb CCCccc DD Dd dd AA Aaaa AA Aaaa BB Bb bb BB Bb bb SNP 1 SNP 2 SNP 4 SNP 3 Curse of Dimensionality
39
Bellman R (1961) Adaptive control processes: A guided tour. Princeton University Press: Bellman R (1961) Adaptive control processes: A guided tour. Princeton University Press: “... Multidimensional variational problems cannot be solved routinely.... This does not mean that we cannot attack them. It merely means that we must employ some more sophisticated techniques.”
40
Traditional Methods suffer Tree-based methods: Tree-based methods: Recursive Partitioning (Helix Tree) Recursive Partitioning (Helix Tree) Random Forests (R, CART) Random Forests (R, CART) Pattern recognition methods: Pattern recognition methods: Symbolic Discriminant Analysis (SDA) Symbolic Discriminant Analysis (SDA) Mining association rules Mining association rules Neural networks (NN) Neural networks (NN) Support vector machines (SVM) Support vector machines (SVM) Data reduction methods: Data reduction methods: DICE (Detection of Informative Combined Effects) DICE (Detection of Informative Combined Effects) MDR (Multifactor Dimensionality Reduction) MDR (Multifactor Dimensionality Reduction) Logic regression … Logic regression … (e.g., Onkamo and Toivonen 2006) Alternatives
41
Type of data Qualitative (categorical) 1 independent variable 2 independent variables Quantitative (measurement) Relationships Differences 2 groups Multiple groups Nonparametric Parametric 2 dependent variables Goodness of fit x 2 Independence test x 2 1 predictor Multiple predictors Continuous measurement Ranks Multiple regression Spearman r s Primary interest Degree of relationship Form of relationship Pearson r Regression independent dependent 2-sample t Mann-Whitney U Related sample t Wilcoxon T 1 IV Multiple IVs independent dependent One-way ANOVA Kruskal-Wallis H Factorial ANOVA Repeated measures ANOVA Friedman McNemar test Hypothesis Testing
42
Multi-locus Methods Parametric methods: Regression Logistic or (Bagged) logic regression Non-parametric methods: Combinatorial Partitioning Method (CPM) quantitative phenotypes; interactions Multifactor-Dimensionality Reduction (MDR) qualitative phenotypes; interactions Machine learning and data mining
43
Limitation of Regression Having too many Having too many independent variables in relation to the number of observed outcome events Assuming 10 bi-allelic loci: # of Parameters = Main effect 2-locus interaction 3-locus interaction 4-locus interaction # of Parameters 201809603360
44
Limitation of Regression Fewer Fewer than 10 outcome events per independent variable can lead to biased estimates of the regression coefficients and to an increase in Type 1 and Type 2 errors. For 200 cases and 200 controls, this formula suggests that no more than 19 (= 200/10 – 1) parameters should be estimated in logistic regression model. # of parameters P min(n case, n control )/10 - 1
45
MDR An extension of CPM, which finds the genotype partitions within which a (quantitative) trait variability is much lower than between partitions An extension of CPM, which finds the genotype partitions within which a (quantitative) trait variability is much lower than between partitions MDR MDR reduces the dimensionality of multi-locus information to one-dimension, thereby improving the identification of polymorphism combinations associated with disease risk The one-dimensional multi-locus genotype variable is evaluated for its ability to classify and predict disease status through cross-validation and permutation testing
46
Two Measures for Selection of Best n-locus model Misclassification error: Misclassification error: The proportion of incorrect classification in the training set. Prediction error (PE): Prediction error (PE): The proportion of incorrect prediction in the test set.
47
MDR Steps All combinations of 2 factors = 10*9/2 = 45 A single model with minimum classification error is the best Model 9/10 training data 1/10 test data 10 runs 10 cross-validation 10 best models. The model with minimum PE is the best n-locus model.
48
Best Multi-factor Models Best 2-factor model Best 3-factor model Best 4-factor model Best 5-factor model Best 6-factor model. Best n-factor model
49
Model Selection and Evaluation Among the Among the best n-factor models, the best model is: The model with the minimum average PE. The model with the maximum average CVC. Rule of parsimony: If there is a tie, select the smaller model.
50
MDR Analysis Window (MDR_Overview.pdf)
51
Significance of the Final Model Via permutation tests: Via permutation tests: Randomize the Randomize the the case and control labels in the original dataset multiple times to create a set of permuted datasets. Run MDR Run MDR on each permuted dataset. Maximum Maximum CVC and minimum PE identified for each dataset saved and used to create an empirical distribution for estimation of a P-value.
52
Measures in Selection of Final model Cross-validation consistency (CVC) Cross-validation consistency (CVC) In every run, In every run, # of times the same MDR model is identified in m cross-validation.1 CVC m. Average cross-validation consistency Average cross-validation consistency Average of Average of CVC across all runs. Average misclassification error Average misclassification error Average across Average across all cross-validations and all runs. Average prediction error Average prediction error Average prediction error Average prediction error across all cross-validations and all runs.
53
200 cases and 200 controls; 10 SNPs: 1, 2, 3, …, 10. Disease etiology due to interaction between SNP 1 and SNP 6. Simulation I Over 10 CVs and 10 runs
54
Simulation II 50 replicates of 200 cases and 200 controls; 10 SNPs: 1, 2, 3, …, 10. Disease risk is dependent on whether two deleterious alleles and two normal alleles are present, from either one locus or both loci. 2-locus epistatis model; 3-locus epistatis model; 4-locus epistatis model; 5-locus epistatis model.
55
Mean and standard error of the mean calculated from 50 replicates. Power 78% 82% 94% 90% (Ritchie et al, 2001)
56
Power of MDR in Presence of Genotyping Error, Missing Data, Phenocopy, and Genetic Heterogeneity no noise 5% genotyping error -- GE 5% missing data -- MS 50% phenocopy -- PC 50% genetic heterogeneity – GH GE + MS … GE+MS+PC … GE+MS+PC+GH 6 models 4 models Total 16 models
59
Advantages of MDR Simultaneous detection Simultaneous detection of multiple genetic loci associated with a discrete clinical endpoint in absence of main effect. Non-parametric: Non-parametric: Overcomes “curse of dimensionality” from which logistic regression models suffer. No particular genetic model No particular genetic model Low false positive rates Low false positive rates
60
Disadvantages of MDR Computationally very intensive. Computationally very intensive. Only feasible for relatively small number of factors. Impractical to test very high-dimensional models. When When the dimensionality of the best model is relatively high and the sample is relatively small, many observations in the test set can not be predicted. This impacts the SEM of prediction error. Low power in the presence of heterogeneity Low power in the presence of heterogeneity
61
Issues to Consider I: Variable selection I: Variable selection II: Model selection II: Model selection III: Interpretation III: Interpretation
62
I: Variable Selection How can you determine which variables to select? How can you determine which variables to select? Not computationally feasible to evaluate all possible combinations Not computationally feasible to evaluate all possible combinations Need to select correct variables to detect interactions Need to select correct variables to detect interactions
63
How many combinations are there? ~500,000 SNPs span 80% of common variation in genome (HapMap) SNPs in each subset 5 x 10 5 2 x 10 16 1 x 10 11 3 x 10 21 2 x 10 26 Number of Possible Combinations
64
II: Model Selection For each variable subset, evaluate a statistical model For each variable subset, evaluate a statistical model Goal is to identify the best subset of variables that compose the best model Goal is to identify the best subset of variables that compose the best model
65
III: Interpretation Selection of best statistical model in a vast search space of possible models Selection of best statistical model in a vast search space of possible models Statistical or computational model may not translate into biology Statistical or computational model may not translate into biology May not be able to identify prevention or treatment strategies directly May not be able to identify prevention or treatment strategies directly Wet lab experiments will be necessary, but may not be sufficient Wet lab experiments will be necessary, but may not be sufficient
66
Interpretation Strategies to assess biological interpretation of gene-gene interaction models Strategies to assess biological interpretation of gene-gene interaction models Consider current knowledge about the biochemistry of the system and the biological plausibility of the models Consider current knowledge about the biochemistry of the system and the biological plausibility of the models Perform experiments in the wet lab to measure the effect of small perturbations to the system Perform experiments in the wet lab to measure the effect of small perturbations to the system Computer simulation algorithms to model biochemical systems Computer simulation algorithms to model biochemical systems
67
MDR: To keep in Mind Candidate SNP selection: Candidate SNP selection: The selection of final model is highly dependent on the selection of n factors at the beginning. Selection of the best n-factor model: Selection of the best n-factor model: Keeping one best n-factor model from all combinations is actually a greedy search algorithm, which might lead to local maximum; yet nice power results and practice has proven its usefulness. Performance when heterogeneity is present in the data: Performance when heterogeneity is present in the data: Phenotypic (diff clinical expressions), genetic (diff inheritance patterns), locus (diff genes), allelic (diff alleles in same gene)
68
References for MDR Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. Multifactor- dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001 Jul;69(1):138-47. Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. Multifactor- dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001 Jul;69(1):138-47. Ritchie MD, Hahn LW, Moore JH. Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol. 2003 Feb;24(2):150-7. Ritchie MD, Hahn LW, Moore JH. Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol. 2003 Feb;24(2):150-7. Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting gene- gene and gene-environment interactions. Bioinformatics. 2003 Feb 12;19(3):376-82. Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting gene- gene and gene-environment interactions. Bioinformatics. 2003 Feb 12;19(3):376-82. Moore JH. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum Hered. 2003;56(1-3):73-82. Moore JH. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum Hered. 2003;56(1-3):73-82. Cho YM, Ritchie MD, Moore JH, Park JY, Lee KU, Shin HD, Lee HK, Park KS. Multifactor- dimensionality reduction shows a two-locus interaction associated with Type 2 diabetes mellitus. Diabetologia. 2004 Mar;47(3):549-54. Cho YM, Ritchie MD, Moore JH, Park JY, Lee KU, Shin HD, Lee HK, Park KS. Multifactor- dimensionality reduction shows a two-locus interaction associated with Type 2 diabetes mellitus. Diabetologia. 2004 Mar;47(3):549-54. Ritchie MD, Motsinger AA. Multifactor dimensionality reduction for detecting gene-gene and gene- environment interactions in pharmacogenomics studies. Pharmacogenomics. 2005 Dec;6(8):823- 34. Ritchie MD, Motsinger AA. Multifactor dimensionality reduction for detecting gene-gene and gene- environment interactions in pharmacogenomics studies. Pharmacogenomics. 2005 Dec;6(8):823- 34. Martin ER, Ritchie MD, Hahn L, Kang S, Moore JH. A novel method to identify gene-gene effects in nuclear families: the MDR-PDT. Genet Epidemiol. 2006 Feb;30(2):111-23. Martin ER, Ritchie MD, Hahn L, Kang S, Moore JH. A novel method to identify gene-gene effects in nuclear families: the MDR-PDT. Genet Epidemiol. 2006 Feb;30(2):111-23. Andrew AS, Nelson HH, Kelsey KT, Moore JH, Meng AC, Casella DP, Tosteson TD, Schned AR, Karagas MR. Concordance of multiple analytical approaches demonstrates a complex relationship between DNA repair gene SNPs, smoking and bladder cancer susceptibility. Carcinogenesis. 2006 May;27(5):1030-7. Andrew AS, Nelson HH, Kelsey KT, Moore JH, Meng AC, Casella DP, Tosteson TD, Schned AR, Karagas MR. Concordance of multiple analytical approaches demonstrates a complex relationship between DNA repair gene SNPs, smoking and bladder cancer susceptibility. Carcinogenesis. 2006 May;27(5):1030-7. Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006 Jul 21;241(2):252-61. Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006 Jul 21;241(2):252-61.
69
Acknowledgements Slides content based on material from Jie Chen, Frank Emmert-Streib, Earl F Glynn, Hua Li, Bolan Linghu, Arcady R Mushegian, Yan Meng, Jurg Ott, Marylyn Ritchie, Antonio Salas, Chris Seidel, Matt McQueen, Christoph Lange and discussions with Steve Horvath, Nan M. Laird, Stephen Lake, Christoph Lange, Ross Lazarus, Matthew McQueen, Benjamin Raby, Nuria Malats, Marylyn Ritchie (lab), Edwin K. Silverman, Scott T. Weiss, Xin Xu, …
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.