Statistical methods for genetic association studies http://www.stats.gla.ac.uk/~paulj/assoc_study_stats.ppt
A tutorial on statistical methods for population association studies David Balding Nature Reviews Genetics (2006) 7:781-791 This talk is based on a review by David Balding from Imperial College London. It covers only one kind of association study, a population based
? Genetics G×E interaction Environment Health outcome or We want to know why two people with the same environmental exposure differ in their susceptibility to disease. Partic common complex diseases, heart disease, diabetes, etc. California Cholesterol levels 50-90%, Scandinavia Mortality due to heart disease 50-60%. So we look at the DNA. We might be able to genotype subjects for one strong candidate mutation, but usually we will have little or no idea what’s going on. This is the approach I’m going to talk about today.
Recombination X/x: unobserved causative mutation A/a: distant marker B/b: linked marker A X a x Gametophytes (gamete-producing cells) Gametes Recombination B b To understand assoc crucial to understand the process of recombination. If you look in any almost cell of your body you’ll find two sets of chromosomes, 23 from each parent. When we produce our own germ cells, sperm or eggs, each cell has just one copy. Process involves recomb. Crucial because it breaks down statistical association between markers.
Approaches to finding disease genes Population-based association study “unrelated” subjects Family-based association study nuclear families Admixture mapping recently admixed population Linkage mapping large pedigrees Darvasi & Shifman (2005) Nature Genetics
Types of population association study Candidate causative polymorphism SNP (single nucleotide polymorphism), deletion, duplication Candidate causative gene (5-50 marker SNPs) evidence from linkage study or function Candidate causative region (100s of marker SNPs) evidence from linkage study Genome-wide (>300,000 marker SNPs) no prior evidence required
Common disease common variant (CDCV) hypothesis
Preliminary analysis: data quality Assuming mating is random and the population is large, HWE genotype frequencies will apply Allele frequencies: P(X) = p P(x) = q HWE genotype frequencies: P(XX) = p2 P(Xx) = 2pq P(xx) = q2 Useful data quality check: chi-squared or exact test log QQ plot But can discard causative mutations p q p2 pq q2
Log QQ plot
Preliminary analysis: dealing with missing data Imputation various methods: maximum likelihood; probalistic; ‘hot-deck’; regression modelling test for independence of ‘missingness’ and case-control status
Choice of inheritance model Snapdragons Antirrhinum majus
Choice of inheritance model Snapdragons Antirrhinum majus
Choice of inheritance model Snapdragons Antirrhinum majus
Tests of association: single SNP Case-control Treat genotype as factor with 3 levels, perform 2x3 goodness-of-fit test. Loses power if effect is additive Count alleles rather than individuals, perform 2x2 goodness-of-fit test. Out of favour because sensitive to deviation from HWE risk estimates not interpretable Major allele homozygote (0) Heterozygote (1) Minor allele homozygote (2) Case Control
Tests of association: single SNP Case-control Cochran-Armitage test loses power if additivity assumption wrong For complex traits additivity often thought to be a good model Cochran-Armitage test
Tests of association: single SNP Case-control Armitage or goodness-of-fit? Depends on: Prior knowledge of inheritance (additive, dominant, etc) Genotype frequencies, e.g. use Armitage test when minor allele is rare, goodness-of-fit test otherwise For complex traits additivity often thought to be a good model
Tests of association: single SNP Case-control Logistic regression Easily incorporates inheritance model (additive, dominant, etc) But assumes phenotype is outcome variable not genotype, so easier to justify for prospective studies For complex traits additivity often thought to be a good model
Tests of association: single SNP Continuous outcome Linear regression Ordered categorical outcomes Multinomial regression But must be normal and equal variance
Problems: population stratification Cases
Correcting for population stratification Genomic control Genotype null SNPs and use to calculate background inflation in test statistic due to population stratification Limited to simple single-SNP analyses Can over- or under-correct Other approaches using null SNPs Regression, principal components analysis, model underlying demography
Problems: multiple testing Bonferroni correction conservative when SNPs are linked Permutation computationally demanding False discovery rate Bayesian approaches
Tests of association: multiple SNPs Advantages Many SNPs may be linked to a gene, but individually may not have a significant effect Interactions between SNPs can be modelled ‘Tag’ SNPs can reduce testing of redundant linked SNPs Methods Linear regression, logistic regression Armitage test Haplotype-based methods Natural interpretation But power reduced due to multiple alleles
Haplotypes Nature Genetics 37, 915 - 916 (2005)
Crucially, any stretch of recombining DNA can be divided into regions of high LD (haplotypes), and the history of this haplotype can be represented as a tree. Tag SNPs. 5-10 times fewer loci.
Inferring haplotype phase
Inferring haplotype phase ?
Inferring haplotype phase
Inferring haplotype phase
Inferring haplotype phase Methods & software PHASE, FASTPHASE EH+ FBAT HAPLOTYPER EM-DECODER PLEM HAP HAPLORE Haplo.stat SNPEM PEDPHASE SNPHAP TDTHAP
Inferring haplotype phase Phase cases and controls separately or pooled? Separating can give inflated type I error Pooling can reduce power