Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut.

Slides:



Advertisements
Similar presentations
Marius Nicolae Computer Science and Engineering Department
Advertisements

Why this paper Causal genetic variants at loci contributing to complex phenotypes unknown Rat/mice model organisms in physiology and diseases Relevant.
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
Yasuhiro Fujiwara (NTT Cyber Space Labs)
Model-based species identification using DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Ion Măndoiu and Sotirios.
METHODS FOR HAPLOTYPE RECONSTRUCTION
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
High resolution detection of IBD Sharon R Browning and Brian L Browning Supported by the Marsden Fund.
Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1, Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department,
University of Connecticut
Profiles for Sequences
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Ronnie A. Sebro Haplotype reconstruction BMI /21/2004.
Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity Justin Kennedy Dissertation Defense for the Degree.
Computational Challenges in Whole-Genome Association Studies Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Lecture 5: Learning models using EM
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin.
Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads Justin Kennedy 1 Joint work with Sanjiv Dinakar 1, Yozen.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
A Comparison of Algorithms for Species Identification based on DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Alexander.
ISBRA 2007 Tutorial A: Scalable Algorithms for Genotype and Haplotype Analysis Ion Mandoiu (University of Connecticut) Alexander Zelikovsky (Georgia State.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
. Learning Parameters of Hidden Markov Models Prepared by Dan Geiger.
Scalable Algorithms for Analysis of Genomic Diversity Data Bogdan Paşaniuc Department of Computer Science & Engineering University of Connecticut.
Imputation-based local ancestry inference in admixed populations Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
HMM - Basics.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
An Efficient Method of Generating Whole Genome Sequence for Thousands of Bulls Chuanyu Sun 1 and Paul M. VanRaden 2 1 National Association of Animal Breeders,
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Calculation of IBD State Probabilities Gonçalo Abecasis University of Michigan.
Laxman Yetukuri T : Modeling of Proteomics Data
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
Bayesian MCMC QTL mapping in outbred mice Andrew Morris, Binnaz Yalcin, Jan Fullerton, Angela Meesaq, Rob Deacon, Nick Rawlins and Jonathan Flint Wellcome.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
California Pacific Medical Center
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Imputation-based local ancestry inference in admixed populations
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Biostatistics-Lecture 19 Linkage Disequilibrium and SNP detection
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
(H)MMs in gene prediction and similarity searches.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Meiotic gene conversion in humans: rate, sex ratio, and GC bias Amy L. Williams June 19, 2013 University of Chicago.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
Hidden Markov Models BMI/CS 576
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Constrained Hidden Markov Models for Population-based Haplotyping
Imputation-based local ancestry inference in admixed populations
Hidden Markov Model LR Rabiner
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
IBD Estimation in Pedigrees
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut

Outline Introduction Likelihood Sensitivity Approach to Error Detection HMM-Based Algorithms Experimental Results Conclusion

Genotyping Errors A real problem despite advances in genotyping technology [Zaitlen et al. 2005] found 1.1% inconsistencies among the 20 million dbSNP genotypes typed multiple times Error types Systematic errors (e.g., assay failure) detected by departure from HWE [Hosking et al. 2004] For pedigree data some errors detected as Mendelian Inconsistencies (MIs) Undetected errors E.g., if mother/father/child are all heterozygous, any error is Mendelian consistent Only ~30% detectable as MIs for trios [Gordon et al. 1999]

Effects of Undetected Genotyping Errors Even low error levels can have large effects for some study designs (e.g. rare alleles, haplotype-based) Errors as low as.1% can increase Type I error rates in haplotype sharing transmission disequilibrium test (HS- TDT) [Knapp&Becker04] 1% errors decrease power by 10-50% for linkage, and by 5-20% for association [Douglas et al. 00, Abecasis et al. 01]

Related Work Improved genotype calling algorithms [Di et al. 05, Rabbee&Speed 06, Nicolae et al. 06] Explicit modeling in analysis methods [Kruglyak et al 96, Sieberts et al. 01, Sobel et al. 02, Abecasis et al. 02] Computationally complex Separate error detection step [Douglas et al. 00, Becker et al. 06] Detected errors can be retyped, imputed, or ignored in downstream analyses

Outline Introduction Likelihood Sensitivity Approach to Error Detection HMM-Based Algorithms Experimental Results Conclusion

Likelihood Sensitivity Approach to Error Detection [Becker et al. 06] Mother Father Child Likelihood of best phasing for original trio T h h h h h h 4

Likelihood Sensitivity Approach to Error Detection [Becker et al. 06] Mother Father Child Likelihood of best phasing for original trio T ? h’ h’ h’ h’ h’ h’ 4 Likelihood of best phasing for modified trio T’

Likelihood Sensitivity Approach to Error Detection [Becker et al. 06] Mother Father Child ?  Large change in likelihood suggests likely error  Flag genotype as an error if L(T’)/L(T) > R, where R is the detection threshold (e.g., R=10 4 )

Implementation in FAMHAP [Becker et al. 06] Window-based algorithm For each window including the SNP under test, generate list of H most frequent haplotypes (default H=50) Find most likely trio phasings by pruned search over the H 4 quadruples of frequent haplotypes Flag genotype as an error if L(T’)/L(T) > R for at least one window Mother … Father … Child …

Limitations of FAMHAP Implementation Truncating the list of haplotypes to size H may lead to sub- optimal phasings and inaccurate L(T) values False positives caused by nearby errors (due to the use of multiple short windows) Our approach: HMM model of haplotype diversity  all haplotypes are represented + no need for short windows Alternate likelihood functions  scalable runtime

Outline Introduction Likelihood Sensitivity Approach to Error Detection HMM-Based Algorithms Experimental Results Conclusion

HMM Model Similar to models proposed by [Schwartz 04, Rastas et al. 05, Kimmel&Shamir 05] Block-free model, paths with high transition probability correspond to “founder” haplotypes (Figure from Rastas et al. 07)

HMM Training Previous works use EM training of HMM based on unrelated genotype data Our 2-step algorithm exploits pedigree info Step 1: Infer haplotypes using pedigree-aware algorithm based on entropy-minimization Step 2: train HMM based on inferred haplotypes, using Baum- Welch

Alternate Likelihood Functions Maximum phasing probability is hard to compute We use alternate likelihood functions that are monotonic under data deletion & efficiently computable: Viterbi probability (ViterbiProb): the maximum probability of a set of 4 HMM paths that emit 4 haplotypes compatible with the trio Probability of Viterbi Haplotypes (ViterbiHaps): product of total probabilities of the 4 Viterbi haplotypes Total Trio Probability (TotalProb): total probability P(T) that the HMM emits four haplotypes that explain trio T along all possible 4- tuples of paths

For a fixed trio, Viterbi paths can be found using a 4-path version of Viterbi’s algorithm in time K 3 speed-up by factoring common terms: Efficient Computation of Viterbi Probability = maximum probability of emitting SNP genotypes at locus j+1 from states  = transition probability Where:

Viterbi probability Likelihoods of all 3N modified trios can be computed within time using forward-backward algorithm Overall runtime for M trios Probability of Viterbi haplotypes Obtain haplotypes from standard traceback, then compute haplotype probabilities using forward algorithms Overall runtime Total trio probability Similar pre-computation speed-up & forward-backward algorithm Overall runtime Overall Runtimes

Outline Introduction Likelihood Sensitivity Approach to Error Detection HMM-Based Algorithms Experimental Results Conclusion

Datasets Real dataset [Becker et al. 2006] 35 SNPs per individual genotype sequence 551 trios Synthetic datasets 35 SNPs, trios Preserved missing data pattern of real dataset Haplotypes assigned to trios based on frequencies inferred from real dataset 1% error rate, four error insertion models Random allele Random genotype Heterozygous-to-homozygous Homozygous-to-heterozygous

Experimental Setup Two strategies for handling MIs Set all three individuals to unknown prior to error detection, or Set child only to unknown (preserving parents’ original data) Two testing strategies Test one SNP genotype: ViterbiProb-1, ViterbiHaps-1, TotalProb-1 Simultaneously test three SNP genotypes at the same locus: ViterbiProb-3, ViterbiHaps-3, TotalProb-3

Comparison with FAMHAP (Random Allele Errors)

Children vs. Parents (Random Allele Errors)

Error Model Comparison (TrioProb-1 Parents)

Error Model Comparison (TrioProb-1 Children)

Unrelated vs. Trio Likelihood Sensitivity Unrelated ViterbiProb-1 Likelihood ratios (children) Trio ViterbiProb-1 Likelihood ratios (children)

Pedigree Info vs. Sample Size Effect

TrioProb-1 Results on Real Dataset [Becker et al. 06] resequenced all trio members at 41 loci flagged by FAMHAP-3 23 SNP genotypes were identified as true errors 41*3-23=100 resequenced SNP genotypes agree with original calls Predictive value for R=10 4 is between 18/26=69% and 24/26=92%, compared to 23/41=56% for FAMHAP-3

Outline Introduction Likelihood Sensitivity Approach to Error Detection HMM-Based Algorithms Experimental Results Conclusion

We have proposed efficient methods for error detection in trio genotype data based on a HMM model of haplotype diversity Exploiting pedigree info is very useful Improved detection accuracy compared to FAMHAP Runtime linear in #SNPs and #trios Ongoing work Improve detection accuracy via iteration Fix MIs using likelihood before error detection Correct errors with high likelihood ratio, then recompute likelihood ratios (possibly after re-phasing and HMM re-training) Integration with genotype calling algorithms Combine low level intensity data with haplotype-based likelihoods

Questions?

Accuracy Measures Sensitivity TP/(TP+TF) Predictive value TP/(TP+FP) Specificity TN/(FP+TN) False Positive rate 1-Specificity