University of Connecticut

Slides:



Advertisements
Similar presentations
Imputation for GWAS 6 December 2012.
Advertisements

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
SNP Applications statwww.epfl.ch/davison/teaching/Microarrays/snp.ppt.
G ENOTYPE AND SNP C ALLING FROM N EXT - GENERATION S EQUENCING D ATA Authors: Rasmus Nielsen, et al. Published in Nature Reviews, Genetics, Presented.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Genotype and Haplotype Reconstruction from Low- Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Basics of Linkage Analysis
High resolution detection of IBD Sharon R Browning and Brian L Browning Supported by the Marsden Fund.
Association Mapping David Evans. Outline Definitions / Terminology What is (genetic) association? How do we test for association? When to use association.
MALD Mapping by Admixture Linkage Disequilibrium.
Approaching the Long-Range Phasing Problem using Variable Memory Markov Chains Samuel Angelo Crisanto 2015 Undergraduate Research Symposium Brown University.
Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity Justin Kennedy Dissertation Defense for the Degree.
Computational Challenges in Whole-Genome Association Studies Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
LD-Based Genotype and Haplotype Inference from Low-Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University of.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Ion Mandoiu Computer Science and Engineering Department
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin.
. Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger.
Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads Justin Kennedy 1 Joint work with Sanjiv Dinakar 1, Yozen.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
ISBRA 2007 Tutorial A: Scalable Algorithms for Genotype and Haplotype Analysis Ion Mandoiu (University of Connecticut) Alexander Zelikovsky (Georgia State.
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
. Learning Parameters of Hidden Markov Models Prepared by Dan Geiger.
Scalable Algorithms for Analysis of Genomic Diversity Data Bogdan Paşaniuc Department of Computer Science & Engineering University of Connecticut.
Algorithms for Genotype and Haplotype Inference from Low- Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University.
Imputation-based local ancestry inference in admixed populations Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Introduction to BST775: Statistical Methods for Genetic Analysis I Course master: Degui Zhi, Ph.D. Assistant professor Section on Statistical Genetics.
. Basic Model For Genetic Linkage Analysis Lecture #5 Prepared by Dan Geiger.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Imputation 2 Presenter: Ka-Kit Lam.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
Gene Hunting: Linkage and Association
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Quantitative Genetics
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
1 Haplotyping Algorithms Qunyuan Zhang Division of Statistical Genomics GEMS Course M Computational Statistical Genetics Mar. 29,
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
California Pacific Medical Center
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
The International Consortium. The International HapMap Project.
Imputation-based local ancestry inference in admixed populations
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Types of genome maps Physical – based on bp Genetic/ linkage – based on recombination from Thomas Hunt Morgan's 1916 ''A Critique of the Theory of Evolution'',
Meiotic gene conversion in humans: rate, sex ratio, and GC bias Amy L. Williams June 19, 2013 University of Chicago.
Analysis of Next Generation Sequence Data BIOST /06/2015.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Constrained Hidden Markov Models for Population-based Haplotyping
Imputation-based local ancestry inference in admixed populations
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals  Brian L. Browning, Sharon.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

University of Connecticut Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology Ion Mandoiu University of Connecticut

Outline HMM model of haplotype diversity Applications Conclusions Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions

Single Nucleotide Polymorphisms Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs) High density in the human genome:  1  107 SNPs out of total 3  109 base pairs … ataggtccCtatttcgcgcCgtatacacgggActata … … ataggtccGtatttcgcgcCgtatacacgggTctata … … ataggtccCtatttcgcgcCgtatacacgggTctata …

Haplotypes and Genotypes Diploids: two homologous copies of each autosomal chromosome One inherited from mother and one from father Haplotype: description of SNP alleles on a chromosome 0/1 vector: 0 for major allele, 1 for minor Genotype: description of alleles on both chromosomes 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2 - the chromosomes contain different alleles 011100110 001000010 021200210 + two haplotypes per individual genotype

Sources of Haplotype Diversity: Mutation The International HapMap Consortium. A Haplotype Map of the Human Genome. Nature 437, 1299-1320. 2005.

Sources of Haplotype Diversity: Recombination

Haplotype Structure in Human Populations

HMM Model of Haplotype Frequencies Fn … H1 H2 Hn Fi = founder haplotype at locus i, Hi = observed allele at locus i P(Fi), P(Fi | Fi-1) and P(Hi | Fi) estimated from reference genotype or haplotype data For given haplotype h, P(H=h|M) can be computed in O(nK2) using forward algorithm Similar models proposed in [Schwartz 04, Rastas et al. 05, Kimmel&Shamir 05, Scheet&Stephens 06]

Outline HMM model of haplotype diversity Applications Conclusions Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions

Genotype Phasing ? h1:0010111 h2:0010010 g: 0010212 h3:0010011

Maximum Likelihood Genotype Phasing … F1 F2 Fn H1 H2 Hn … F'1 F'2 F'n H'1 H'2 H'n G1 G2 Gn Maximum likelihood genotype phasing: given g, find (h1,h2) = argmaxh1+h2=g P(h1|M)P(h2|M)

Computational Complexity [KMP08] Cannot approximate maxh1+h2=g P(h1|M)P(h2|M) within a factor of O(n1/2 -), unless ZPP=NP [Rastas et al.] give Viterbi and randam sampling based heuristics that yield phasing accuracy comparable to best existing methods (PHASE)

Outline HMM model of haplotype diversity Applications Conclusions Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions

Genotyping Errors A real problem despite advances in technology & typing algorithms 1.1% of 20 million dbSNP genotypes typed multiple times are inconsistent [Zaitlen et al. 2005] Systematic errors (e.g., assay failure) typically detected by departure from HWE [Hosking et al. 2004] In pedigrees, some errors detected as Mendelian Inconsistencies (MIs) Many errors remain undetected As much as 70% of errors are Mendelian consistent for mother/father/child trios [Gordon et al. 1999]

Likelihood Sensitivity Approach to Error Detection in Trios 0 1 2 1 0 2 0 2 2 1 0 2 Mother Father Child Likelihood of best phasing for original trio T 0 1 1 1 0 0 h1 0 0 0 1 0 1 h3 0 1 1 1 0 0 h1 0 1 0 1 0 1 h2 0 0 0 1 0 1 h3 0 1 1 1 0 0 h4

Likelihood Sensitivity Approach to Error Detection in Trios 0 1 2 1 0 2 0 2 2 1 0 2 Mother Father Child 0 1 0 1 0 1 h’ 1 0 0 0 1 0 0 h’ 3 0 1 0 1 0 1 h’1 0 1 1 1 0 0 h’2 0 0 0 1 0 0 h’ 3 0 1 1 1 0 1 h’ 4 Likelihood of best phasing for modified trio T’ ? Likelihood of best phasing for original trio T

Likelihood Sensitivity Approach to Error Detection in Trios Mother Father 0 1 2 1 0 2 0 2 2 1 0 2 Child 0 2 2 1 0 2 ? Large change in likelihood suggests likely error Flag genotype as an error if L(T’)/L(T) > R, where R is the detection threshold (e.g., R=104)

Alternate Likelihood Functions [KMP08] Cannot approximate L(T) within O(n1/4 -), unless ZPP=NP Efficiently Computable Likelihood Functions Viterbi probability Probability of Viterbi Haplotypes Total Trio Probability

Comparison of Alternative Likelihood Functions (1% Random Allele Errors)

Log-Likelihood Ratio Distribution FPs caused by same-locus errors in parents

“Combined” Detection Method Compute 4 likelihood ratios Trio Mother-child duo Father-child duo Child (unrelated) Flag as error if all ratios are above detection threshold

Comparison with FAMHAP (Children)

Comparison with FAMHAP (Parents)

Outline HMM model of haplotype diversity Applications Conclusions Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions

Genome-Wide Association Studies Powerful method for finding genes associated with complex human diseases Large number of markers (SNPs) typed in cases and controls Disease causal SNPs unlikely to be typed directly Significant statistical power gained by performing imputation of untyped Hapmap genotypes [WTCCC’07]

HMM Based Genotype Imputation Train HMM using the haplotypes from related Hapmap or small cohor typed at high density Probability of missing genotypes given the typed genotype data  gi is imputed as

Experimental Results Estimates of the allele 0 frequency based on Imputation vs. Illumina 15k

Experimental Results Accuracy and missing data rate for imputed genotypes at different thresholds

Outline HMM model of haplotype diversity Applications Conclusions Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions

Ultra-High Throughput Sequencing New massively parallel sequencing technologies deliver orders of magnitude higher throughput compared to Sanger sequencing -SBS: Sequencing by Synthesis -SBL: Sequencing by Ligation -Challenges in Genome Assembly: The short read lengths and absence of paired ends make it difficult for assembly software to disambiguate repeat regions, therefore resulting in fragmented assemblies. -New Type of sequencing error: in 454 including incorrect estimates of homopolymer lengths, ‘transposition-like’ insertions (a base identical to a nearby homopolymer is inserted in a nearby nonadjacent location) and errors caused by multiple templates attached to the same bead Roche / 454 Genome Sequencer FLX 100 Mb/run, 400bp reads Illumina / Solexa Genetic Analyzer 1G 1000 Mb/run, 35bp reads Applied Biosystems SOLiD 3000 Mb/run, 25-35bp reads 30

Probabilistic Model … F1 F2 Fn H1 H2 Hn … F'1 F'2 F'n H'1 H'2 H'n G1 The Hierarchical-Factorial HMM we used to describe haplotype sequences is similar to models recently proposed by others. At the core of the model are 2 regular HMM representing haplotype frequencies in the populations of origin of the sequenced individual’s parents. Under this model each haplotype in the population is viewed as a mosaic formed as a result of historical recombination among a set of K founder haplotypes. Increasing the number of founders allows more paths to describe a haplotype sequence, but at a cost of a longer run time. Training & the estimation of this HMM is done via a Baum-Welch algorithm based on haplotype inferred from a panel representing the population of origin of each parent. G1 G2 Gn R1,1 … R1,c R2,1 … R2,c Rn,1 … Rn,c 1 2 n 31

Model Training Initial founder probabilities P(f1), P(f’1), transition probabilities P(fi+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise where is the probability that read r has an error at locus I  Conditional probabilities for sets of reads are given by: The Hierarchical-Factorial HMM we used to describe haplotype sequences is similar to models recently proposed by others. At the core of the model are 2 regular HMM representing haplotype frequencies in the populations of origin of the sequenced individual’s parents. Under this model each haplotype in the population is viewed as a mosaic formed as a result of historical recombination among a set of K founder haplotypes. Increasing the number of founders allows more paths to describe a haplotype sequence, but at a cost of a longer run time. Training & the estimation of this HMM is done via a Baum-Welch algorithm based on haplotype inferred from a panel representing the population of origin of each parent. 32

Multilocus Genotyping Problem GIVEN: Shotgun read sets r=(r1, r2, … , rn) Base quality scores HMMs for populations of origin for mother/father FIND: Multilocus genotype g*=(g*1,g*2,…,g*n) with maximum posterior probability, i.e., g*=argmaxg P(g | r) NOTE: P(g|r) is NP-Hard… 33

Posterior Decoding Algorithm For each i = 1..n, compute Return Joint probabilities can be computed using a forward-backward algorithm: Direct implementation gives O(m+nK4) time, where m = number of reads n = number of SNPs K = number of founder haplotypes in HMMs Runtime reduced to O(m+nK3) using speed-up idea similar to [Rastas et al. 08, Kennedy et al. 08] 34

Genotyping Accuracy on Watson Reads

Outline HMM model of haplotype diversity Applications Conclusions Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions

Conclusions HMM model of haplotype diversity provides a powerful framework for addressing central problems in population genetics & genetic epidemiology Enables significant improvements in accuracy by exploiting the high amount of linkage disequilibrium in human populations Despite hardness results, heuristics such as posterior or Viterbi decoding perform well in practice Highly scalable runtime (linear in #SNPs and #individuals/reads) Software available at http://www.engr.uconn.edu/~ion/SOFT/

Acknowledgements Sanjiv Dinakar, Jorge Duitama, Yözen Hernández, Justin Kennedy, Bogdan Pasaniuc NSF funding (awards IIS-0546457 and DBI-0543365)