Imputation-based local ancestry inference in admixed populations

Slides:



Advertisements
Similar presentations
Imputation for GWAS 6 December 2012.
Advertisements

Generalized Regional Admixture Mapping (RAM) and Structured Association Testing (SAT) David T. Redden, Associate Professor, Department of Biostatistics,
Marius Nicolae Computer Science and Engineering Department
Model-based species identification using DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Ion Măndoiu and Sotirios.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Sampling distributions of alleles under models of neutral evolution.
Genotype and Haplotype Reconstruction from Low- Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
High resolution detection of IBD Sharon R Browning and Brian L Browning Supported by the Marsden Fund.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
A New Nonparametric Bayesian Model for Genetic Recombination in Open Ancestral Space Presented by Chunping Wang Machine Learning Group, Duke University.
24/07/2007ISMB/ECCB /07/2007ISMB/ECCB 2007 Bayesian association of haplotypes and non-genetic factors to regulatory and phenotypic variation in.
University of Connecticut
Admixture Mapping Qunyuan Zhang Division of Statistical Genomics GEMS Course M Computational Statistical Genetics Computational Statistical Genetics.
MStruct: Structure under mutations Suyash Shringarpure and Eric Xing Carnegie Mellon University mStruct: Inference of population structure in the presence.
Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity Justin Kennedy Dissertation Defense for the Degree.
Computational Challenges in Whole-Genome Association Studies Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
LD-Based Genotype and Haplotype Inference from Low-Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University of.
Bioinformatics pipeline for detection of immunogenic cancer mutations by high throughput mRNA sequencing Jorge Duitama 1, Ion Mandoiu 1, and Pramod Srivastava.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Ion Mandoiu Computer Science and Engineering Department
DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion.
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin.
Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads Justin Kennedy 1 Joint work with Sanjiv Dinakar 1, Yozen.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
A Comparison of Algorithms for Species Identification based on DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Alexander.
ISBRA 2007 Tutorial A: Scalable Algorithms for Genotype and Haplotype Analysis Ion Mandoiu (University of Connecticut) Alexander Zelikovsky (Georgia State.
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
. Learning Parameters of Hidden Markov Models Prepared by Dan Geiger.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Scalable Algorithms for Analysis of Genomic Diversity Data Bogdan Paşaniuc Department of Computer Science & Engineering University of Connecticut.
Algorithms for Genotype and Haplotype Inference from Low- Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University.
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.
Imputation-based local ancestry inference in admixed populations Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data Jorge Duitama 1, Pramod Srivastava 2, and Ion.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Imputation 2 Presenter: Ka-Kit Lam.
MStruct: A New Admixture Model for Inference of Population Structure in Light of Both Genetic Admixing and Allele Mutations Suyash Shringarpure and Eric.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
INTRODUCTION TO ASSOCIATION MAPPING
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
California Pacific Medical Center
Revisiting Output Coding for Sequential Supervised Learning Guohua Hao & Alan Fern School of Electrical Engineering and Computer Science Oregon State University.
Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.
Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture Duke University Machine Learning Group Presented by Kai Ni August.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Biostatistics-Lecture 19 Linkage Disequilibrium and SNP detection
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
Population stratification
Understanding human admixture, and association mapping in admixed populations. Simon Myers.
Constrained Hidden Markov Models for Population-based Haplotyping
Imputation-based local ancestry inference in admixed populations
Haplotype Reconstruction
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model
Garrett McKinney Jim Seeb Lisa Seeb
Brian K. Maples, Simon Gravel, Eimear E. Kenny, Carlos D. Bustamante 
A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals  Brian L. Browning, Sharon.
Yu Zhang, Tianhua Niu, Jun S. Liu 
Presentation transcript:

Imputation-based local ancestry inference in admixed populations Justin Kennedy Computer Science and Engineering Department University of Connecticut Joint work with I. Mandoiu and B. Pasaniuc

Outline Introduction Factorial HMM of genotype data Algorithms for genotype imputation and ancestry inference Preliminary experimental results Conclusion I’ll begin with an introduction that gives an overview, motivation, and more formal definition of the ancestry inference problem Then describe our approach to handling this problem, which will include a Factorial HMM of genotype data, and algorithms for genotype imputation and ancestry inference. We have recently implemented our approach in a software package, and I will show some preliminary experimental results that have come from this package. Finally I will conclude with a summary of our contribution and list some future work items.

Motivation: Admixture mapping Introduction- Motivation: Admixture mapping Admixture mapping is a method for localizing disease causing genetic variants that differ in frequency across populations. It is most advantageous to apply this approach to populations that have descended from a recent mix of two ancestral groups that have been geographically isolated for many tens of thousands of years (e.g. African Americans) Patterson et al, AJHG 74:979-1000, 2004 3

Inferred local ancestry Introduction- Local ancestry inference problem Given: Reference haplotypes for ancestral populations P1,…,PN Whole-genome SNP genotype data for extant individual Find: Allele ancestries at each SNP locus Reference haplotypes 1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000 11110010011001101001110010110101011111011110111000 1110001001000100111110001111011100111?111110111000 011101100110011011111100101101110111111111?0110000 11100010010001001111100010110111001111111110110000 011?001?011001101111110010?10111011111111110110000 11100110010001001111100011110111001111111110111000 1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000 11110010011001101001110010110101011111011110111000 1110001001000100111110001111011100111?111110111000 011101100110011011111100101101110111111111?0110000 11100010010001001111100010110111001111111110110000 011?001?011001101111110010?10111011111111110110000 11100110010001001111100011110111001111111110111000 1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000 11110010011001101001110010110101011111011110111000 1110001001000100111110001111011100111?111110111000 011101100110011011111100101101110111111111?0110000 11100010010001001111100010110111001111111110110000 011?001?011001101111110010?10111011111111110110000 11100110010001001111100011110111001111111110111000 Inferred local ancestry Extant: still in existence; not extinct or destroyed or lost rs11095710 P1 P1 rs11117179 P1 P1 rs11800791 P1 P1 rs11578310 P1 P2 rs1187611 P1 P2 rs11804808 P1 P2 rs17471518 P1 P2 ... SNP genotypes rs11095710 T T rs11117179 C T rs11800791 G G rs11578310 G G rs1187611 G G rs11804808 C C rs17471518 A G ... 4

Introduction- Previous work MANY methods Ancestry inference at different granularities, assuming different kinds/amounts of info about genetic makeup of ancestral populations Two main classes of methods HMM-based (exploit LD): SABER [Tang et al 06], SWITCH [Sankararaman et al 08a], HAPAA [Sundquist et al. 08], … Window-based (unlinked SNP Data): LAMP [Sankararaman et al 08b], WINPOP [Pasaniuc et al. 09] Poor accuracy when ancestral populations are closely related (e.g. Japanese and Chinese) Methods based on unlinked SNPs outperform methods that model LD! The HMM-based classes differ in the exact structure of the model and the procedures used for estimating model parameters, but all of them exploit LD information. The second class of methods considers each SNP without LD, and estimates the ancestry structure using a window-based framework and aggregates the results for each SNP using a majority vote. These window based methods surprisingly do not perform as well as HMM based methods

Outline Introduction Factorial HMM of genotype data Algorithms for genotype imputation and ancestry inference Preliminary experimental results Conclusion Our method employs a Factorial HMM of genotype data, which DOES exploit LD, and we aim to improve over each of the previous methods.

Haplotype structure in panmictic populations Panmictic: Random mating within a breeding population. To help understand our HMM implementation, consider an extant population with a haplotype gene pool that arose from a small set of ancestral haplotypes. Through random mating and recombination, the extant haplotypes include segments that come directly from varying ancestors, and also there are mutations that have occurred over time.

HMM of haplotype frequencies (# SNPs) K = 4 (# founders) This type of recombination can be captured in a HMM of haplotype diversity, which we employ, and is similar to other models proposed in recent work. Specifically, our HMM is defined by nXK states, where n is the number of SNPs, and K is the number of founder haplotypes. Transitions are from left to right, occur only between adjacent SNPs, and they represent the probability of adherence to, or deviation from, these ancestral founder haplotypes. Emissions at each state represent the probability of observing a major or minor allele. Similar models proposed in [Schwartz 04, Rastas et al. 05, Kennedy et al. 07, Kimmel&Shamir 05, Scheet&Stephens 06,…]

Graphical model representation F1 F2 Fn … H1 H2 Hn Random variables for each locus i (i=1..n) Fi = founder haplotype at locus i; values between 1 and K Hi = observed allele at locus i; values: 0 (major) or 1 (minor) Model training Based on reference haplotypes using Baum-Welch alg, or Based on unphased genotypes using EM [Rastas et al. 05] Given haplotype h, P(H=h|M) can be computed in O(nK2) using a forward algorithm, where n=#SNPs, K=#founders We can represent this HMM graphically as a founder haplotype and observed allele for each locus i (seen here as Fi & Hi respectively). Under this model each haplotype in the current population can be viewed as a mosaic formed as a result of historical recombination among a set of these founder haplotypes. Model training can come from reference haplotypes using Baum-Welch, assuming you have these, OR You can take the unphased genotype data that you have which represents the population of interest, and implement EM to train the HMM. Once you have a satisfactorily trained HMM, you can then compute the probability of observing a haplotype h given model M using a standard forward algorithm, which takes O(nK^2) time…again where n is the # of SNPs and K is the # of founders. Increasing the number of founders allows more paths to describe a haplotype sequence, but at a cost of a longer run time.

Factorial HMM for genotype data in a window with known local ancestry … F1 F2 Fn H1 H2 Hn … F'1 F'2 F'n So given that we have two distinct ancestral groups that have recently admixed and represent an individual or population of interest, we can define the Factorial HMM that we use to be, at the core, 2 regular HMMs that I just described. H'1 H'2 H'n G1 G2 Gn Random variable for each locus i (i=1..n) Gi = genotype at locus i; values: 0/1/2 (major hom./het./minor hom.) 10

Outline Introduction Factorial HMM of genotype data Algorithms for genotype imputation and ancestry inference Preliminary experimental results Conclusion

HMM Based Genotype Imputation Probability of observing genotype at locus i given the known multilocus genotype with missing data at i:  gi is imputed as

Forward-backward computation fi … … hi f’i … … h’i gi 13

Forward-backward computation fi … … hi f’i … … h’i gi 14

Forward-backward computation fi … … hi f’i … … h’i gi 15

Forward-backward computation fi … … hi f’i … … h’i gi 16

Runtime Direct recurrences for computing forward probabilities O(nK4) : Runtime reduced to O(nK3) by reusing common terms: where 17

Imputation-based ancestry inference View local ancestry inference as a model selection problem Each possible local ancestry defines a factorial HMM compute for all possible k,l,i,x values Pick model that re-imputes SNPs most accurately around the locus i. Fixed-window version: pick ancestry that maximizes the average posterior probability of the SNP genotypes within a fixed-size window centered at the locus Multi-window version: weighted voting over window sizes between 200-3000, with window weights proportional to average posterior probabilities

Imputation-based ancestry inference Local Ancestry at a locus is an unordered pair of (not necessarily distinct) ancestral populations. Observations: The local ancestry of a SNP locus is typically shared with neighboring loci. Small Window sizes may not provide enough information Large Window sizes may violate local ancestry property for neighboring loci When using the true values of in ,the accuracy of SNP genotype imputation within such a neighborhood is typically higher than when using a mis-specified model. Longer terms for observations: For individual from recently admixed populations the local ancestry of a SNP locus is typically shared with a large number of neighboring loci. The accuracy of SNP genotype imputation within such a neighborhood is typically higher when using the factorial HMMs corresponding to the correct local ancestry compared to a mis-specified model.

Outline Introduction Factorial HMM of genotype data Algorithms for genotype imputation and ancestry inference Preliminary experimental results Conclusion

HMM imputation accuracy Missing data rate and accuracy for imputed genotypes at different thresholds (WTCCC 58BC/Hapmap CEU) We measured the error rate as the percentage of erroneously recovered genotypes from the total number of masked genotypes. Since the model provides the posterior probability for each imputed SNP genotype, one can get different tradeoffs between the error rate and the percentage of imputed genotypes by varying the cutoff threshold on posterior imputation probability. This figure plots the achievable tradeoffs. For example, using a cutoff threshold of 0.95, HMM-based imputation has an error rate of 1.7%, with 24% of the genotypes left un-imputed.

Window size effect N=2,000 g=7 =0.2 n=38,864 r=10-8 As previously reported in other window-based methods we also notice that the best window size employed by our method for the three datasets is correlated with the genetic distance between ancestral populations as closer ancestral populations benefit from longer window size for accurate predictions. N=2,000 g=7 =0.2 n=38,864 r=10-8

Number of founders effect CEU-JPT N=2,000 g=7 =0.2 n=38,864 r=10-8

Comparison with other methods % of correctly recovered SNP ancestries Alpha is the (X100) percentage of individuals coming from one population, with 1-alpha being the percentage of individuals coming from the second population. N=2,000 g=7 =0.2 n=38,864 r=10-8 24

Untyped SNP imputation error rate in admixed individuals Alpha is the (X100) percentage of individuals coming from one population, with 1-alpha being the percentage of individuals coming from the second population. N=2,000 g=7 =0.5 n=38,864 r=10-8 25

Outline Introduction Factorial HMM of genotype data Algorithms for genotype imputation and ancestry inference Preliminary experimental results Conclusion

Conclusion- Summary and ongoing work Imputation-based local ancestry inference achieves significant improvement over previous methods for admixtures between close ancestral populations Code at http://dna.engr.uconn.edu/software/ Ongoing work Evaluating accuracy under more realistic admixture scenarios (multiple ancestral populations/gene flow/drift in ancestral populations) Extension to pedigree data Exploiting inferred local ancestry for more accurate untyped SNP imputation and phasing of admixed individuals Extensions to sequencing data Inference of ancestral haplotypes from extant admixed populations

Acknowledgments Work supported in part by NSF awards IIS-0546457 and DBI-0543365.