Presentation on theme: "Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun."— Presentation transcript:
Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun Li group UNC at Chapel Hill 09-13-2012
Outline Imputation – Study samples: WHI African Americans and Hispanics samples – Reference haplotypes: 1000 Genomes Project (version 3 March 2012 release) Number of markers in reference haplotypes: ~38M Post imputation quality assessment – Evaluation of imputation quality by comparing with actual genotypes from Metabochip genotyping – Estimation of total number of QC+ markers and number of QC+ indels
QC on WHI Genotypes QC was performed within African American and Hispanics samples separately for autosomes and chromosome X. We excluded markers having: – Hardy-Weinberg equilibrium (HW p-value < 1e-6) – Genotype completeness (< 90%) – Minor allele frequency Chromosome 1-22: MAF < 1% Chromosome X: singleton or monomorphic markers With thanks to Eric Yi Liu
Summary of samples and GWAS QC+ markers Number of Individuals – WHI_AA: 8,421 / WHI_HA: 3,587 Number of markers Chr1-22ChrX WHIAAWHIHAWHIAAWHIHA Total860,51036,889 QC+829,370834,82635,41135,035 Note: chromosome X is currently under imputation, so the results on chromosome X will be available soon.
Reference Haplotypes The complete set of 1000G Phase I Integrated Release version 3 haplotypes in vcf format (March 2012 release) – A total of 2184 haplotypes – A total of ~38M markers including singleton and monomorphic sites – About 1.4M markers are short indels and large deletions, the rest SNPs.
Note on reference haplotypes A latest reduced set of reference haplotypes with singletons and monomorphic markers removed are also available. – Number of markers: ~30M – Every marker in the reduced set is included in the complete set of reference haplotypes. – We expect little influence on imputation quality from singleton and monomorphic markers, because: Phasing of the reference haplotypes were performed with the singleton and monomorphic markers included Our previous evaluation shows little effect of singletons on the quality of imputation (Liu, EY, et al., Genetic Epidemiology, 2012, 36:107-117).
Two-step genotype imputation -- Procedure Step 1: Pre-phasing (MaCH1) – WHI African American and Hispanics samples were phased separately Step 2: Genotype imputation (minimac) – WHI African Americans and Hispanics samples were imputed separately. – Haplotype to haplotype imputation: the pre-phased haplotypes in step 1 are used to impute into the complete set of reference haplotypes from the 1000 Genomes Project.
Two-step genotype imputation -- Computational costs Phasing and imputation strategy – Split chromosomes into segments – Phase / impute each segment – Ligate segments back to chromosomes Computational costsWHI_AAWHI_HA PhasingSplit strategy (sample genotypes) Core region: 3000 markers Flanking: 500 markers each # segment after splitting277278 Median run time~245 hours (~10 days)~63 hours (~3 days) ImputationSplit strategy (reference haplotypes) Core region: 5 Mb Flanking: 500 Kb each Core region: 20 Mb Flanking: 500 Kb each # segment after splitting520150 Median run time~41 hours (~2 days)~71 hours (~3 days)
Summary of imputation results -- Before QC WHIAAWHIHA Number of individuals8,4213,587 Total number of imputed markers38,050,692 Number of imputed indels1,380,758 File size (All files gz compressed) 170 G71 G Note: Markers with quality filter missing in the 1000G reference haplotypes are excluded from imputation. We found all markers excluded are of type “MERGED_DEL”.
Evaluation of imputation quality -- Introduction Main idea – Compare imputed dosages with actual genotypes Quality metric – Dosage r2: squared correlation coefficient between imputed dosages (continuous value ranging between 0 and 2) and actual genotypes (coded as 0, 1 and 2) True imputation accuracy (range 0 ~ 1) – Rsq: estimated dosage r2 Estimated imputation accuracy
Evaluation of imputation quality -- Study design Calculate dosage r2 Imputed dosage 2212111010 2112221210 2211211210 2211212210 2102211201 22121110 21122110 22112110 22112210 21022101 Actual genotype (Metabochip) Individuals used in evaluation 1962 WHI African American samples Markers used in evaluation Overlapping markers between 1000G and Metabochip but not on Affymetrix 6.0 (All 22 autosomes) Minor allele frequency (MAF) is defined within the 1962 individuals
We recommend QC threshold 0.7, 0.6 and 0.3 for MAF 0.1~0.5%, 0.5~1%, and >1% category, respectively – The thresholds are chosen such that an average Rsq greater than 0.8 in each MAF category is achieved (Liu, EY, et al., Genetic Epidemiology, 2012, 36:107-117). Estimation based on imputation quality assessment – Total number of markers passing QC – Total number of indels passing QC Estimation of imputation quality -- Summary
The values are estimated because: – Estimated Rsq cutoffs Evaluation is based on markers on Metabochip – Estimated MAF MAF of imputed markers is calculated based on imputed dosages Estimation based on imputation quality assessment -- Note
The values are estimated because: – Estimated QC thresholds for WHI Hispanics samples We assumed WHI Hispanics has similar Rsq cutoff in each MAF category to WHI African Americans We will do similar quality assessment in Hispanics samples once we have their QC+ metabochip data – Estimated QC thresholds for indels Rsq is set based on evaluation on SNPs. We assumed indels has similar Rsq cutoff in each MAF category to SNPs Estimation based on imputation quality assessment -- Note (cont’d)
Estimation based on imputation quality assessment -- Total number of markers passing QC Note: Markers includes both SNPs and indels
Estimation based on imputation quality assessment -- Number of indels passing QC
Summary We conducted genotype imputation for 8,421 African American and 3,587 Hispanics samples in the Women’s Health Initiative (WHI) study using reference haplotypes from the 1000 Genomes Project (version 3, March 2012 release) Summary of imputation results before and after QC WHIAAWHIHA Before QCAfter QCBefore QCAfter QC Number of individuals 8,421 3,587 Total number of markers 38,050,69218,940,10338,050,69215,214,231 Number of indels 1,380,7581,219,5381,380,7581,126,704 File size (All files gz compressed) 170 G102 G71 G33 G