RNA-Seq based discovery and reconstruction of unannotated transcripts

Slides:

Advertisements

Similar presentations

RNA-Seq as a Discovery Tool

Advertisements

Marius Nicolae Computer Science and Engineering Department

Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.

 Experimental Setup  Whole brain RNA-Seq Data from Sanger Institute Mouse Genomes Project [Keane et al. 2011]  Synthetic hybrids with different levels.

Algorithms for Multisample Read Binning

Peter Tsai Bioinformatics Institute, University of Auckland

Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.

Marius Nicolae and Ion Măndoiu (University of Connecticut, USA)

Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.

Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Transcriptomics Jim Noonan GENE 760.

RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.

Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.

Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.

Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

mRNA-Seq: methods and applications

Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.

Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.

Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani

RNA-Seq and RNA Structure Prediction

Introduction to RNA-Seq and Transcriptome Analysis

LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.

Li and Dewey BMC Bioinformatics 2011, 12:323

Todd J. Treangen, Steven L. Salzberg

Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.

Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,

Introduction to RNA-Seq & Transcriptome Analysis

Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)

Novel transcript reconstruction from ION Torrent sequencing reads and Viral Meta-genome Reconstruction from AmpliSeq Ion Torrent data University of Connecticut.

Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.

Sahar Al Seesi and Ion Măndoiu Computer Science and Engineering

Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.

From Genomes to Genes Rui Alves.

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.

Introduction to RNAseq

California Pacific Medical Center

Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.

TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.

No reference available

Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Adrian Caciula (GSU), Serghei Mangul (UCLA) James Lindsay, Ion.

Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.

RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.

ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on

RNA Quantitation from RNAseq Data

VCF format: variants c.f. S. Brown NYU

Gene expression from RNA-Seq

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Kallisto: near-optimal RNA seq quantification tool

Alexander Zelikovsky Computer Science Department

Gene expression estimation from RNA-Seq data

Reference based assembly

Alternative Splicing QTLs in European and African Populations

Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey

Dec. 22, 2011 live call UCONN: Ion Mandoiu, Sahar Al Seesi

Quantitative analyses using RNA-seq data

Sequence Analysis - RNA-Seq 2

Schematic representation of a transcriptomic evaluation approach.

Volume 11, Issue 7, Pages (May 2015)

Presentation transcript:

RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (GSU) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU) CAME 2011, Atlanta, GA

Genome-Guided RNA-Seq Protocol Make cDNA & shatter into fragments Sequence fragment ends A B C D E Map reads -transcripts: -> are RNA sequences produced from genes -from RNA -> through the process of hybridization we obtain multiple copies of DNA (i.e., cDNA) -> then the cDNA is partitioned into fragments from where we genotype (or type or ‘read’) either one or both ends of the fragment and map them to know exons (coding regions on genome) Isoform Expression (IE) Gene Expression (GE) A B C D E Isoform Discovery (ID) CAME 2011, Atlanta, GA

Using Partial Annotation Existing tools for genome-guided RNA-seq protocol Cufflinks [Trapnell et al. 2010] reports minimal set of transcripts Scripture [Guttman et al. 2010] reports all possible transcripts Besides genomes, partial annotation exists for many species Almost all human genes/exons and majority of transcripts are already known (genomes libraries - UCSC) How can one incorporate it into genome-guided RNA-seq protocol? Cufflinks – RABT [Roberts et al. 2011] – added fake reads uniformly distributed among transcripts This talk: Subtract reads (exon counts) explained by known transcripts Use Virtual String EM (VSEM) [Mangul et al. 2011] Reconstruct transcripts containing unexplained exons Research group of Cufflinks enhanced their reconstruction tool with RABT assembly feature RABT – Reference Annotation Based Transcript assembly The rest of the presentation focuses on our novel approach for a different way to use Partial Annotations Use VSEM to detect exons from unannotated transcripts and then we reconstruct transcripts CAME 2011, Atlanta, GA

Outline EM for Isoform Expression Estimation Virtual Transcript EM Algorithm DRUT: Detection and Reconstruction of Unannotated Transcripts Experimental Results Conclusions Our method for DRUT based on Virtual Transcript EM Alg. CAME 2011, Atlanta, GA

Max Likelihood Model LEFT: transcripts RIGHT: reads unknown frequencies RIGHT: reads observed frequencies EDGES: weight ~ probability of the read to be emitted by the transcript weights are calculated based on the mapping of the reads to the transcripts Given: annotations (transcripts) and frequencies of the reads Find: ML estimate of transcript frequencies Input data of EM  a panel: bipartite graph. transcripts T1 T2 T3 R1 R2 R4 reads R3 Our ML Model Edges connect reads with transcripts that can emit them -> therefore each edge is weighted by the probability of the read to be emitted by the transcript

Generic EM algorithm Initialization: uniform transcript frequencies f(j)’s E-step: Compute the expected number n(j) of reads sampled from transcript j assuming current transcript frequencies f(j) M-step: For each transcript j, set f(j) = portion of reads emitted by transcript j among all reads in the sample ML estimates for f(j) = n(j)/(n(1) + . . . + n(N)) EM Algorithm can not be better than our model. CAME 2011, Atlanta, GA

ML Model Quality How well ML model explain the reads. deviation between expected (ej) and observed read frequency (oj): expected read frequency: If the deviation is high we expect that some transcripts are missing from the panel. |R| - number of reads hsi,j – weighted match based on mapping read rj - maximum-likelihood frequency of transcript ti CAME 2011, Atlanta, GA

Outline EM for Isoform Expression Estimation Virtual Transcript EM Algorithm DRUT: Detection and Reconstruction of Unannotated Transcripts Experimental Results Conclusions We enhance EM Algorithm with Virtual Transcript Model adequately describe the data CAME 2011, Atlanta, GA

Virtual Transcript Expectation Maximization (VTEM) VTEM is based on a modification of Virtual String Expectation Maximization (VSEM) Algorithm [Mangul et al. 2011]). the difference is that we consider in the panel exons instead of reads Calculate observed exon counts based on read mapping each read contribute to count of either one exon or two exons (depending if it is a unspliced spliced read or spliced read) R1 R2 R4 reads R3 transcripts T1 T2 R1 R2 R4 reads R3 exon counts - How? We can see that R1, R2, 3 and 4 were emitted by Transcript 1 and also T3 also emitts 2,3,4. but each tr. consists of several exons and tr. shares exons -> for a more accurate estimation we will consider exons Transcripts consists of exons and reads can be emitted from a single exon or from two exons if they are spliced reads. transcripts T1 T2 E1 E2 exons E3 1 3

Input: Complete vs Partial Annotations Complete Annotations Partial Annotations O 0.25 exons O 0.25 exons E1 E1 transcripts transcripts T1 T1 E2 E2 T2 T2 E3 E3 T3 E4 E4 Observed exon frequencies Transcript T3 is missing from annotations CAME 2011, Atlanta, GA

Adding Virtual Transcript Complete Annotations Partial Annotations O 0.25 exons O 0.25 exons E1 E1 transcripts transcripts T1 T1 E2 E2 T2 T2 E3 E3 VT T3 E4 Dash edges = 0 h_{VT, r_j} = 0 E4 VT Virtual transcript, with weighted probability of VT to emit exon j equals 0 (i.e., dashed edges) 11 CAME 2011, Atlanta, GA

transcript frequencies After 1st EM Run Complete Annotations Partial Annotations O E .25 exons O E .25 .32 .16 exons ML .34 .66 E1 ML .25 .5 E1 transcripts transcripts T1 T1 E2 E2 T2 T2 E3 E3 VT T3 E4 -After 1st EM run we obtain ML-estimated transcript frequencies and Expected exon frequencies E4 VT ML-estimated transcript frequencies Expected exon frequencies CAME 2011, Atlanta, GA

Updating Weights From Virtual Transcript Complete Annotations Partial Annotations O E .25 exons O E .25 .32 .16 exons ML .34 .66 E1 ML .25 .5 E1 transcripts transcripts T1 T1 E2 E2 T2 T2 E3 E3 VT T3 E4 Very frequently transcripts share same exons If O < E -> we need to decrease VT weights, but since they are already 0 we will go to next step E4 VT O > E  Increase VT weights O < E  Decrease VT weights (to 0) Observed = Expected Nothing to update CAME 2011, Atlanta, GA

After 2nd EM Run O E .25 O E .25 .3 .15 ML .32 .65 .03 ML .25 .5 Complete Annotations Partial Annotations O E .25 exons O E .25 .3 .15 exons ML .32 .65 .03 E1 ML .25 .5 E1 transcripts transcripts T1 T1 E2 E2 T2 T2 E3 E3 VT T3 E4 For Complete Annotations, since observed and expected values are the same -> VT frequency stay the same E4 VT VT frequency stays 0 VT frequency increases! Deviation of expected from observed decreases! CAME 2011, Atlanta, GA

After the Last EM Run O E .25 O E .25 ML .20 .60 ML .25 .50 Complete Annotations Partial Annotations O E .25 reads O E .25 reads ML .20 .60 E1 ML .25 .50 E1 transcripts transcripts T1 T1 E2 E2 T2 T2 E3 E3 VT T3 E4 Complete annotation validate our approach showing that if Complete Annotations exists no extra unexisting transcript is generated E4 VT VT frequency is 0 No false positives Overexpressed exons VT frequency (.2) ≈ T3 frequency (.25) VT’s exons (E3,E4)= T3’s exons (E3,E4) CAME 2011, Atlanta, GA

Virtual Transcript Expectation Maximization (VTEM) ML estimates of transcript frequencies Compute expected exons Update weights of reads in virtual transcript EM (Partially) Annotated Genome + Virtual Transcript with 0-weights in virtual transcript Virtual Transcript frequency change>ε? Output overexpressed exons (expressed by virtual transcripts) YES NO Use reads from overexpressed exons as input in Cufflinks for reconstruction of unknown transcripts. Overexpressed exons belongs to unknown transcripts CAME 2011, Atlanta, GA

Outline EM for Isoform Expression Estimation Virtual Transcript EM Algorithm DRUT: Detection and Reconstruction of Unannotated Transcripts Experimental Results Conclusions CAME 2011, Atlanta, GA

Detection and Reconstruction of Unannotated Transcripts a) Map reads to annotated transcripts (using Bowtie) b) VTEM: Identify overexpressed exons (possibly from unannotated transcripts) c) Assemble Transcripts (e.g., Cufflinks) using reads from “overexpressed” exons and unmapped reads d) Output: annotated transcripts + novel transcripts CAME 2011, Atlanta, GA

Outline EM for Isoform Expression Estimation Virtual Transcript EM Algorithm DRUT: Detection and Reconstruction of Unannotated Transcripts Experimental Results Conclusions CAME 2011, Atlanta, GA

Simulation Setup human genome data (UCSC hg18) UCSC database - 66, 803 isoforms 19, 372 genes. Single error-free reads: 60M of length 100bp for partially annotated genome -> remove from every gene exactly one isoform CAME 2011, Atlanta, GA

Distribution of isoforms length and gene cluster sizes in UCSC dataset CAME 2011, Atlanta, GA

Comparison Between Methods Sensitivity - portion of the annotated transcript sequences being captured by candidate transcript sequences CAME 2011, Atlanta, GA

Conclusions We proposed DRUT a novel annotation-guided method for transcriptome discovery and reconstruction in partially annotated genomes. DRUT overperforms existing genome-guided transcriptome assemblers (i.e., Cufflinks) DRUT shows similar or better performance with existing annotation-guided assemblers (Cufflinks - RABT). CAME 2011, Atlanta, GA

Thanks! CAME 2011, Atlanta, GA