Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

A Robust Super Resolution Method for Images of 3D Scenes Pablo L. Sala Department of Computer Science University of Toronto.
Marius Nicolae Computer Science and Engineering Department
RNA-Seq based discovery and reconstruction of unannotated transcripts
Reconstruction of Infectious Bronchitis Virus Quasispecies from NGS Data Bassam Tork Georgia State University Atlanta, GA 30303, USA.
Unsupervised Learning
Model-based species identification using DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Ion Măndoiu and Sotirios.
Sampling: Final and Initial Sample Size Determination
Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1, Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department,
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Paper Discussion: “Simultaneous Localization and Environmental Mapping with a Sensor Network”, Marinakis et. al. ICRA 2011.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
The AutoSimOA Project Katy Hoad, Stewart Robinson, Ruth Davies Warwick Business School WSC 07 A 3 year, EPSRC funded project in collaboration with SIMUL8.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Lecture 5: Learning models using EM
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference (Sec. )
Statistics for the Social Sciences Psychology 340 Fall 2006 Review For Exam 1.
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut.
The AutoSimOA Project Katy Hoad, Stewart Robinson, Ruth Davies Warwick Business School OR49 Sept 07 A 3 year, EPSRC funded project in collaboration with.
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Reconstruction of infectious bronchitis virus quasispecies from 454 pyrosequencing reads CAME 2011 Ion Mandoiu Computer Science & Engineering Dept. University.
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
Inferring Genomic Sequences Irina Astrovskaya Irina Astrovskaya Dr. Alexander Zelikovsky 02/15/2011.
VirVarSeq vs ViVaMBC Pictured above: The structure of HIV.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Binary Encoding and Gene Rearrangement Analysis Jijun Tang Tianjin University University of South Carolina (803)
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Novel transcript reconstruction from ION Torrent sequencing reads and Viral Meta-genome Reconstruction from AmpliSeq Ion Torrent data University of Connecticut.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Section 6-5 The Central Limit Theorem. THE CENTRAL LIMIT THEOREM Given: 1.The random variable x has a distribution (which may or may not be normal) with.
Sampling Error SAMPLING ERROR-SINGLE MEAN The difference between a value (a statistic) computed from a sample and the corresponding value (a parameter)
Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Compositional Assemblies Behave Similarly to Quasispecies Model
INTRODUCTION TO Machine Learning 3rd Edition
Inferring Viral Quasispecies Spectra from NGS Reads Ion Măndoiu Computer Science & Engineering Department University of Connecticut.
California Pacific Medical Center
Bioinformatics tools for viral quasispecies reconstruction from next-generation sequencing data and vaccine optimization PD: Ion Măndoiu, UConn Co-PDs: Mazhar.
From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009.
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Adrian Caciula (GSU), Serghei Mangul (UCLA) James Lindsay, Ion.
A Maximum Likelihood Method for Quasispecies Reconstruction Nicholas Mancuso, Georgia State University Bassam Tork, Georgia State University Pavel Skums,
1 Probability and Statistics Confidence Intervals.
Simulation and Experimental Verification of Model Based Opto-Electronic Automation Drexel University Department of Electrical and Computer Engineering.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Presented by Samuel Chapman. Pyrosequencing-Intro The core idea behind pyrosequencing is that it utilizes the process of complementary DNA extension on.
KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
Chapter 3 INTERVAL ESTIMATES
What is a Hidden Markov Model?
Alexander Zelikovsky Computer Science Department
EM for Inference in MV Data
The Most General Markov Substitution Model on an Unrooted Tree
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
EM for Inference in MV Data
Introduction to Sampling Distributions
DESIGN OF EXPERIMENT (DOE)
Dec. 22, 2011 live call UCONN: Ion Mandoiu, Sahar Al Seesi
Quantitative analyses using RNA-seq data
Presentation transcript:

Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral Quasispecies Reconstruction Based on Unassembled Frequency Estimation

Outline Introduction ML Model EM Algorithm VSEM Algorithm Experimental Results Conclusions and future work ISBRA 2011, Central South University, Changsha, China

454 Pyrosequencing Emulsion PCR Single nucleotide addition —Natural nucleotides —DNA ploymerase pauses until complementary nucleotide is dispensed —Nucleotide incorporation triggers enzymatic reaction that results in emission of light ISBRA 2011, Central South University, Changsha, China

ML Model Panel : bipartite graph —RIGHT: strings >unknown frequencies —LEFT: reads >observed frequencies —EDGES: probability of the read to be emitted by the string >weights are calculated based on the mapping of the reads to the strings ISBRA 2011, Central South University, Changsha, China strings S1 S2 S3 R1 R2 R4 reads R3

ML estimates of string frequencies Probability that a read is sampled from string is proportional with its frequency f(j) ML estimates for f(j) is given by n(j)/(n(1) n(N)) —n(j) - number of reads sampled from string j ISBRA 2011, Central South University, Changsha, China

EM algorithm E-step: Compute the expected number n(j) of reads that come from string j under the assumption that string frequencies f(j) are correct M-step: For each string j, set the new value of f(j) equal to the portion of reads being originated by string j among all observed reads in the sample ISBRA 2011, Central South University, Changsha, China

ML Model Quality How well the maximum likelihood model explain the reads Measured by deviation between expected and observed read frequencies —expected read frequency: ISBRA 2011, Central South University, Changsha, China

VSEM : Virtual String EM ISBRA 2011, Central South University, Changsha, China deviation between expected /observed read frequencies deviation between expected /observed read frequencies ML estimates of string frequencies ML estimates of string frequencies Compute expected read frequencies Compute expected read frequencies update weights of reads in virtual string update weights of reads in virtual string EM (incomplete) panel + virtual string with 0-weights in virtual string (incomplete) panel + virtual string with 0-weights in virtual string Stop condition Output : string frequencies, reads Output : string frequencies, reads EM yes no

Example : 1 st iteration 9 ISBRA 2011, Central South University, Changsha, China strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel O 0.25 O VS

Example : 1 st iteration 10 ISBRA 2011, Central South University, Changsha, China strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel O 0.25 O ML ML VS

Example : 1 st iteration 11 ISBRA 2011, Central South University, Changsha, China 11 ISBRA 2011, Central South University, Changsha, China strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel OE.25 OE ML ML VS

Example : 1 st iteration 12 ISBRA 2011, Central South University, Changsha, China strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel OE.25 OE ML ML VS D=0D=.08

Example : 1 st iteration 13 ISBRA 2011, Central South University, Changsha, China strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel OE.25 OE ML ML VS D=0D=.075 Incomplete Panel

Example : last iteration 14 ISBRA 2011, Central South University, Changsha, China strings S1 S2 S3 R1 R2 R4 reads R3 strings S1 S2 R1 R2 R4 reads R3 Full Panel Incomplete Panel OE.25 OE ML ML VS D=0

VSEM : Virtual String EM Decide if the panel is likely to be incomplete Estimate total frequency of missing strings Identify read spectrum emitted by missing strings ISBRA 2011, Central South University, Changsha, China

ViSpA ViSpA [Astrovskaya et al. 2011] – viral spectrum assembling tool for inferring viral quasispecies sequences and their frequencies from pyroseqencing shotgun reads —align reads —built a read graph : >V – reads >E – overlap between reads >each path – candidate sequence —filter based on ML frequencies 16 ISBRA 2011, Central South University, Changsha, China

ViSpA-VSEM 17 ISBRA 2011, Central South University, Changsha, China ViSPA Weighted assembler assembled Qsps Qsps Library VSEM Virtual String EM reads, weights Viral Spectrum +Statistics reads ViSpA ML estimator removing duplicated & rare qsps Stopping condition YES NO

Simulation Setup and Accuracy Measures Real quasispecies sequences data from [von Hahn et al. 2006] —44 sequences (1739 bp long) from the E1E2 region of Hepatitis C virus —Error-free data was simulated by in-house simulator >populations sizes: 10, 20, 30, and 40 sequences >population distributions: geometric, skewed normal, uniform Accuracy measures —Kullback-Leibler divergence —Correlation between real and predicted frequencies —Average prediction error 18 ISBRA 2011, Central South University, Changsha, China

Experimental Validation of VSEM Detection of panel incompleteness —VSEM can detect 1% of missing strings Improving quasispecies frequencies Detection of reads emitted by missing string —Correlation between predicted reads and reads emitted by missing strings >65% 19 ISBRA 2011, Central South University, Changsha, China

EM vs VSEM 20 ISBRA 2011, Central South University, Changsha, China % of missing strings r.l./n.r<10%10%-20%20%-30%30%-40%40%-50%>50% rerrr r r r r ViSpA100/20K ViSpA-VSEM100/20K ViSpA300/20K ViSpA-VSEM300/20K ViSpA100/100K ViSpA-VSEM100/100K ViSpA300/100K ViSpA-VSEM300/100K

ViSpA vs ViSpA-VSEM 21 ISBRA 2011, Central South University, Changsha, China ViSpAViSpA-VSEM DistributionPPVSensetivityRErerrPPVSensetivityRErerrGain Geometric Skewed Uniform K reads from 10 QSPS average length 300

ViSpA vs ViSpA-VSEM #mismatches ViSpAViSpA-VSEM PPVSensetivityRErerrPPVSensetivityRErerrGain k = k = k = k = ISBRA 2011, Central South University, Changsha, China 100K reads from 10 QSPS average length 300

Conclusions & Future Work Apply VSEM to RNA-Seq data Assemble missing strings from the set of reads emitted by missing strings Handle chimerical strings presented in the panel 23 ISBRA 2011, Central South University, Changsha, China

Acknowledgments NFS … 24 ISBRA 2011, Central South University, Changsha, China

非常感謝 25 ISBRA 2011, Central South University, Changsha, China