2 IntroductionImputation describes the process of predicting genotypes that have not been directly typed in a sample of individuals:missing genotypes at typed variants;genotypes at un-typed variants that are present in an external high-density “reference panel” of phased haplotypes.In silico genotypes can be tested for association within standard generalised linear regression framework.
4 What is the purpose of imputation? Increased power. The reference panel is more likely to contain the causal variant (or a better tag) than a GWAS array.Fine-mapping. Imputation provides a high-resolution overview of an association signal across a locus.Meta-analysis. Imputation allows GWAS typed with different arrays to be combined up to variants in the reference panel.
5 Increased power and improved fine-mapping resolution
6 IMPUTEv2 and minimacPre-phasing. Estimate haplotypes at variants typed in the study sample (scaffold).Haploid imputation. Study sample haplotypes are considered an unknown path through haplotypes from the reference panel.Hidden Markov model (HMM).Switch probability between reference haplotypes depends on recombination rate.Allelic mismatch between reference and observed haplotypes can be incorporated by allowing for low rate of mutation.Less computationally demanding than diploid imputation that attempts to jointly phase and impute simultaneously (IMPUTEv1 and MaCH).
7 Reference panelsLarge-scale genotyping and re-sequencing reference panels made available through HapMap Consortium and 1000 Genomes Project.HapMap CEU, 60 YRI and 90 CHB/JPT individuals typed for ~3M variants.HapMap individuals from multiple ethnic groups typed for ~1.6M variants.1000 Genomes. Most recent release includes 1094 individuals from multiple ethnic groups typed for ~30M variants (including indels).
8 Choice of reference panel Imputation software designed for use with 1000 Genomes reference panels, but remain computationally demanding.Making use of the “all ancestries” reference panel (rather than ethnic-specific reference panel) improves imputation accuracy for rare variants.Formatted reference panels for IMPUTEv2 and minimac can be downloaded from the software websites.
9 Factors affecting imputation accuracy Scaffold. Number of individuals and GWAS array used for genotyping (coverage of variation).Reference panel. Number of individuals and density of typing. Similarity of ancestry with study sample.Minor allele frequency.Pre-phasing or diploid imputation (minimal).
11 Imputation quality control Pre-imputation. Essential that GWAS scaffold excludes poor quality variants. Common to exclude MAF<1% variants.Post imputation. Imputation quality assessed by “information measures” in range 0-1.Information measure α in a scaffold of N individuals has equivalent power to αN perfectly genotyped individuals.Typical to filter SNPs by α (exclude <0.8, <0.4).IMPUTEv2 “info score” and minimac ȓ2.In loci identified through imputation, important to check quality of typed SNPs in the scaffold in the region by visual inspection of cluster plots.
12 Analysis of imputed genotypes For each individual, imputation provides probability distribution of possible genotypes at each un-typed variant from the reference panel.Using best guess genotype, or filtering on probability of best guess genotype can increase false positives and reduce power.Convert probabilities to “expected allele count”, i.e. p1+2p2.Fully take account of the uncertainty in the imputation in a “missing data likelihood”.Software: SNPTEST2 (for IMPUTEv2) and Mach2Dat (for minimac).
13 Rare variants and complex disease Rare variants are likely to have arisen from founder effects in the last few generations.Rare variants are expected to have larger effects on complex traits that common variants.Statistical methods focus on the accumulation of minor alleles at rare variants (mutational load) within the same functional unit.
14 GRANVILTest of association of phenotype with proportion of rare variants at which individuals carry minor alleles.Model disease phenotype via regression on pi and any other covariates in GLM framework.pi = 3/10Reedik Magi
15 Assaying rare genetic variation Gold-standard approach to assaying rare genetic variation is through re-sequencing, which is expensive on the scale of the whole genome.GWAS genotyping arrays are inexpensive, but are not designed to capture rare genetic variation.Increasing availability of large-scale reference panels of whole-genome re-sequencing data: 1000 Genomes Project and the UK10K Project.Impute into GWAS scaffolds up to these reference panels to recover genotypes at rare variants at no additional cost, other than computing.
16 GRANVIL: imputed variants Test of association of phenotype with proportion of rare variants at which individuals carry minor alleles.Replace direct genotypes with posterior probability of heterozygous or rare homozygous call from imputation.Model disease phenotype via regression on pi and any other covariates in GLM framework.pi = 3.0/10
17 Application to WTCCCGWAS of seven complex human diseases from the UK (2000 cases each and 3000 shared controls from 1958 British Birth Cohort and National Blood Service):bipolar disease (BD), coronary artery disease (CAD), Crohn’s disease (CD), hypertension (HT), rheumatoid arthritis (RA), type 1 diabetes (T1D) and type 2 diabetes (T2D).Individuals genotyped using the Affymetrix GeneChip 500K Mapping Array Set.After quality control, 16,179 samples and 391,060 autosomal SNPs (MAF>1%) carried forward for analysis.
18 Fine-scale UK population structure Fine-scale population structure may have greater impact on rare variants than on common SNPs because of recent founder effects.Utilised EIGENSTRAT to construct principal components to represent axes of genetic variation across the UK: 27,770 high-quality LD pruned (r2<0.2) common autosomal SNPs (MAF>5%).
20 Imputation SNPs mapped to NCBI build 37 of human genome. Samples imputed up to 1000 Genomes Phase 1 cosmopolitan reference panel (June 2011 interim release).8.23M imputed autosomal rare variants (MAF<1%) polymorphic in WTCCC.5.38M (65.3%) were “well-imputed” (i.e. Info score > 0.4) and carried forward for analysis.Mean info score was 0.618, and 17.3% had info score > 0.8.
21 Rare variant analysisTest for association of each disease with accumulation of rare variants (MAF<1%) within genes using GRANVIL.Gene boundaries defined from UCSC human genome database (build 37).Analyses adjusted for three principal components to adjust for fine-scale UK population structure.Genome-wide significance threshold p<1.7x10-6: Bonferroni adjustment for 30,000 genes.
23 Rare variant association with T1D Genome-wide significant evidence of association of T1D with rare variants in multiple genes from the MHC.Strongest signal of association observed for HLA-DRA (p=2.0x10-13).Gene contains 23 well imputed rare variants with mean MAF of 0.32%.Accumulations of minor alleles across these variants were associated with decreased risk of disease: odds ratio ( ) per minor allele.
24 T1D association across the MHC Ten genes achieve genome-wide significant evidence of rare variant association with T1D.HLA-DRASLC44A4HLA-DRB5PBX2TNXAPBMUCL2EHMT2AGPAT1C6orf10NCR3
25 T1D association across the MHC After additional adjustment for additive effect of lead GWAS common variant from the MHC (rs ).PBX2HLA-DRAHLA-DRB5SLC44A4SKIVL2HLA-DMAPBMUCL2EHMT2AGPAT1TNXB
27 CommentsGRANVIL assumes the same direction of effect on the trait of all rare variants within the functional unit.Methodology allowing for different directions of effect of rare variants are well established for re-sequencing data, and are being generalised to allow for imputation.The most powerful rare variant test will depend on the underlying genetic architecture of the trait.