1 5 6 4 2 RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul, Adrian Caciula, Ion.

Slides:

Advertisements

Similar presentations

Gene Prediction: Similarity-Based Approaches

Advertisements

RNA-Seq as a Discovery Tool

Marius Nicolae Computer Science and Engineering Department

RNA-Seq based discovery and reconstruction of unannotated transcripts

Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.

Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Peter Tsai Bioinformatics Institute, University of Auckland

Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.

Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.

Transcriptomics Jim Noonan GENE 760.

How many transcripts does it take to reconstruct the splice graph? Introduction Alternative splicing is the process by which a single gene may be used.

Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.

Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.

Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

mRNA-Seq: methods and applications

Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Similarity-Based Approaches.

Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.

LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.

Li and Dewey BMC Bioinformatics 2011, 12:323

Todd J. Treangen, Steven L. Salzberg

Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.

SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA

Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,

Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)

The iPlant Collaborative

Novel transcript reconstruction from ION Torrent sequencing reads and Viral Meta-genome Reconstruction from AmpliSeq Ion Torrent data University of Connecticut.

Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.

Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.

Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.

Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya.

Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.

Complexities of Gene Expression Cells have regulated, complex systems –Not all genes are expressed in every cell –Many genes are not expressed all of.

Introduction to RNAseq

Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.

TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.

Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Adrian Caciula (GSU), Serghei Mangul (UCLA) James Lindsay, Ion.

PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS) LECTURE 13 ANALYSIS OF THE TRANSCRIPTOME.

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.

Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.

Canadian Bioinformatics Workshops

Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.

KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.

Canadian Bioinformatics Workshops

RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.

ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.

bacteria and eukaryotes

GCC Workshop 9 RNA-Seq with Galaxy

Is the end of RNA-Seq alignment?

WS9: RNA-Seq Analysis with Galaxy (non-model organism )

RNA-Seq analysis in R (Bioconductor)

High-Throughput Analysis of Genomic Data [S7] ENRIQUE BLANCO

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Alexander Zelikovsky Computer Science Department

1 Department of Engineering, 2 Department of Mathematics,

Reference based assembly

From: TopHat: discovering splice junctions with RNA-Seq

1 Department of Engineering, 2 Department of Mathematics,

Genome organization and Bioinformatics

1 Department of Engineering, 2 Department of Mathematics,

Alternative Splicing QTLs in European and African Populations

Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey

Dec. 22, 2011 live call UCONN: Ion Mandoiu, Sahar Al Seesi

Presentation transcript:

RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion Mandoiu** and Alexander Zelikovsky* *Georgia State University, **University of Connecticut Expectation Maximization (EM) Maximum Likelihood (ML) Model Introduction Alternative Splicing Simulation Setup: human genome data (UCSC hg18) UCSC database - 66, 803 isoforms 19, 372 genes, Single error-free reads: 60M of length 100bp for partially annotated genome -> remove from every gene exactly one isoform Fig. 9(a) shows that in genes with more transcripts is more difficult to correctly reconstruct all transcripts. As a result Cufflinks performs better on genes with few transcripts since annotations are not used in it standard settings. DRUT has higher sensitivity on genes with 2 and 3 transcripts, but RABT is better on genes with 4 transcripts. For genes with more than 4 transcripts performance of annotation-guided methods is equal to ”existing annotations ratio”, which mean what these methods are unable to reconstruct unannotated transcript.. References Genes, Exons, Introns, and Splicing Fig. 4. Transcripts – Exons –Reads Relation. 1. S. Mangul, I. Astrovskaya, M. Nicolae, B. Tork, I. Mandoiu, and A. Zelikovsky, “Maximum likelihood estimation of incomplete genomic spectrum from hts data,” in Proc. 11th Workshop on Algorithms in Bioinformatics, C. Trapnell, B. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. van Baren, S. Salzberg, B. Wold, and L. Pachter, “Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.” Nature biotechnology, vol. 28, no. 5, pp. 511–515, Fig. 9. a) Sensitivity and PPV of the methods grouped by the number of transcripts per gene. Here, 60M single reads of length 100bp are simulated * Cufflinks is a well known tool for transcriptome reconstruction [2]. Discovery and Reconstruction of Unannotated Transcripts Virtual Transcript Expectation Maximization (VTEM) Experimental Results Fig 3. Panel: Bipartite Graph - consisting of transcripts with unknown frequencies and reads with observed frequency ( o j ) Gene - a segment of DNA or RNA that carries genetic information. Exon - a region of a gene which is translated into protein Intron - a region of a gene which is not translated into protein Splicing – a process in which the introns are removed and exons are joined to be translated into a single protein the process in which exons can be spliced out in different combinations named transcripts to generate the mature RNA. Alternative splicing is a common mode of gene regulation within cells, being used by 90–95% of human genes. It can drastically alter the function of a gene in different tissue types or environmental conditions, or even inactivate the gene completely. Alternative splicing is implicated in many diseases. Input data of EM is a panel: a bipartite graph a set of candidate transcripts that are believed to emit the set of reads weighted match based on mapping of the read i to the transcripts j ( h Tj, i ) FIND: ML estimate of transcript frequencies SUBPROBLEMS: Decide if the panel is likely to be incomplete Estimate total frequency of missing transcripts Identify read spectrum emitted by missing transcripts Assemble missing transcripts from read spectrum emitted by missing transcripts ML Estimates of Transcripts Frequencies Probability that a read is sampled from transcript j is proportional with f(j) f(j) transcript (unknown) frequency ML estimates for f(j) is given by n(j)/(n(1) n(N)) n(j) denotes the number of reads sampled from transcript j INITIALIZATION: Uniform transcript frequencies f(j) ‘s E STEP: Compute the expected number n(j) of reads sampled from transcript j (assuming current transcript frequencys f(j) ) M STEP: For each transcript j, set of f(j) = portion of reads emitted by transcript j among all reads in the sample Quality of ML Model The possible gaps in the ML model include: erroneous reads caused by genotyping errors missing and/or chimerical candidate transcripts an inaccurate read to transcript match (caused by genotyping errors) non-uniform emitting of reads by transcripts Measure the quality of ML model by deviation D of observed reads from expected reads (e j ) Expected read frequencies (e j ) are calculated based on weighted match between reads and strings maximum likelihood frequencies estimations of transcripts ( ) Fig. 2. Alternative Splicing Process Fig. 1. Chromosome with its DNA |R| is the number of reads Fig4 shows the relation between transcripts, exons and reads - LEFT: transcripts -> unknown frequencies - RIGHT: reads -> Observed frequencies - EDGES: weights ~ probability of the read to be emitted by the transcript ML Problem: GIVEN: Annotations (transcripts) and frequencies of the reads. a) Map reads to annotated transcripts (using Bowtie) b) VTEM: Identify “overexpressed” exons (possibly from unannotated transcripts) c) Assemble Transcripts (e.g., Cufflinks) using reads from “overexpressed” exons and unmapped reads d) Output: annotated transcripts + novel transcripts DRUT (Discovery and Reconstruction of Unannotated Transcripts): GIVEN: A set of transcripts and frequencies for the reads. FIND : Transcripts missing from the set. Fig 8. An example of VTEM estimation Fig. 7. VTEM