KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.

Slides:



Advertisements
Similar presentations
Marius Nicolae Computer Science and Engineering Department
Advertisements

RNA-Seq based discovery and reconstruction of unannotated transcripts
Reconstruction of Infectious Bronchitis Virus Quasispecies from NGS Data Bassam Tork Georgia State University Atlanta, GA 30303, USA.
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
Hidden Markov Model in Biological Sequence Analysis – Part 2
Model-based species identification using DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Ion Măndoiu and Sotirios.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT.
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
A New Nonparametric Bayesian Model for Genetic Recombination in Open Ancestral Space Presented by Chunping Wang Machine Learning Group, Duke University.
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
… Hidden Markov Models Markov assumption: Transition model:
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
CpG islands in DNA sequences
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau.
1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.
Hidden Markov Models Usman Roshan BNFO 601. Hidden Markov Models Alphabet of symbols: Set of states that emit symbols from the alphabet: Set of probabilities.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
1 Probability. 2 Probability has three related “meanings.” 1. Probability is a mathematical construct. Probability.
. Learning Parameters of Hidden Markov Models Prepared by Dan Geiger.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Reconstruction of infectious bronchitis virus quasispecies from 454 pyrosequencing reads CAME 2011 Ion Mandoiu Computer Science & Engineering Dept. University.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Recitation on EM slides taken from:
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
California Pacific Medical Center
Bioinformatics tools for viral quasispecies reconstruction from next-generation sequencing data and vaccine optimization PD: Ion Măndoiu, UConn Co-PDs: Mazhar.
Probabilistic Context Free Grammars Grant Schindler 8803-MDM April 27, 2006.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
A Maximum Likelihood Method for Quasispecies Reconstruction Nicholas Mancuso, Georgia State University Bassam Tork, Georgia State University Pavel Skums,
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Population sequencing using short reads: HIV as a case study Vladimir Jojic et.al. PSB 13: (2008) Presenter: Yong Li.
Monkey Business Bioinformatics Research Center University of Aarhus Thomas Mailund Joint work with Asger Hobolth, Ole F. Christiansen and Mikkel H. Schierup.
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
Hidden Markov Models BMI/CS 576
Constrained Hidden Markov Models for Population-based Haplotyping
Alexander Zelikovsky Computer Science Department
Presentation transcript:

kGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko

Introduction Introduction Reconstructing spectrum of viral population Challenges: – Assembling short reads to span entire genome – Distinguishing sequencing errors from mutations Avoid assembling: – ID sequences via high variability region

Previous Work Previous Work KEC (k-mer Error Correction) [Skums et al.] – Incorporates counts (frequencies) of k-mers (substrings of length k) QuasiRecomb (Quasispecies Recombination) [Töpfer et. al] – Hidden Markov Model-based approach – Incorporates possibility for recombinant progeny – Parameter: k generators (ancestor haplotypes)

Problem Formulation Problem Formulation Given: a set of reads R emitted by a set of unknown haplotypes H’ Find: a set of haplotypes H = {H 1,…,H k } maximizing Pr(R|H)

Fractional Haplotype Fractional Haplotype Fractional Haplotype: a string of 5-tuples of probabilities for each possible symbol: a, c, t, g, d=‘-’ a c - tctgc a c t g d

kGEM kGEM Initialize (fractional) Haplotypes Repeat until Haplotypes are unchanged Estimate Pr(r|H i ) probability of a read r being emitted by haplotype H i Estimate frequencies of Haplotypes Update and Round Haplotypes Collapse Identical and Drop Rare Haplotypes Output Haplotypes

Initialization Initialization Find set of reads representing haplotype population – Start with a random read – Each next read maximizes minimum distance to previously chosen

Initialization Initialization Transform selected reads into fractional haplotypes using formula: where s m is i-th nucleotide of selected read s. a c - tg- g a -c ε=0.01 a c t g d

Read Emission Probability Read Emission Probability For each i=1, …, k and for each read r j from R compute value: Reads Haplotypes h 1,1 h 3,2 h 2,1 h 3,1 h 1,2 h 2,2

Estimate Frequencies Estimate Frequencies Estimate haplotype frequencies via Expectation Maximization (EM) method Repeat two steps until the change < σ E-step: expected portion of r emitted by H i M-step: updated frequency of haplotype H i

Update Haplotypes Update Haplotypes Update allele frequencies for each haplotype according to read’s contribution: a … c t g d

Round each haplotype’s position to most probable allele a … c t g d a … c t g d a … c t g d a … c t g d a … c t g d Round Haplotypes Round Haplotypes ac-tactgc

Collapse and Drop Rare Collapse and Drop Rare Collapse haplotypes which have the same integral strings Drop haplotypes with coverage ≤ δ – Empirically, δ<5 implies drop in PPV without improving sensitivity

kGEM kGEM Initialize (fractional) Haplotypes Repeat until Haplotypes are unchanged Estimate Pr(r|H i ) probability of a read r being emitted by haplotype H i Estimate frequencies of Haplotypes Update and Round Haplotypes Collapse Identical and Drop Rare Haplotypes Output Haplotypes

Experimental Setup Experimental Setup HCV E1E2 sub-region (315bp) 20 simulated data sets of 10 variants 100,000 reads from Grinder datasets with homo-polymer errors Frequency distribution: uniform and power-law model with parameter α= 2.0

Nicholas Mancuso Alex Zelikovsky Pavel Skums Ion M ă ndoiu Acknowledgements Acknowledgements

Thank you! Questions?