Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.

Slides:



Advertisements
Similar presentations
Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone.
Advertisements

Marius Nicolae Computer Science and Engineering Department
RNA-Seq based discovery and reconstruction of unannotated transcripts
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
 Experimental Setup  Whole brain RNA-Seq Data from Sanger Institute Mouse Genomes Project [Keane et al. 2011]  Synthetic hybrids with different levels.
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
RNAseq.
12/04/2017 RNA seq (I) Edouard Severing.
Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.
1 Detection of nTARs in the mouse intestinal transcriptome BMC Genomics, 2011.
Peter Tsai Bioinformatics Institute, University of Auckland
RNA-seq: the future of transcriptomics ……. ?
Marius Nicolae and Ion Măndoiu (University of Connecticut, USA)
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
mRNA-Seq: methods and applications
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
RNA-Seq and RNA Structure Prediction
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
RNAseq analyses -- methods
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Next Generation DNA Sequencing
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Computational Methods for Analysis of Single Cell RNA-Seq Data
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
The iPlant Collaborative
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Novel transcript reconstruction from ION Torrent sequencing reads and Viral Meta-genome Reconstruction from AmpliSeq Ion Torrent data University of Connecticut.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Sahar Al Seesi and Ion Măndoiu Computer Science and Engineering
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Adrian Caciula (GSU), Serghei Mangul (UCLA) James Lindsay, Ion.
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Canadian Bioinformatics Workshops
How to design arrays with Next generation sequencing (NGS) data Lecture 2 Christopher Wheat.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Introduction to RNA-seq Joel Parker, Ph.D.. LCCC Biomedical Informatics UNCseq: Cancer genome analysis of UNC Hospital patients TCGA: Processed.
KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.
Canadian Bioinformatics Workshops
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
Multi-Genome Multi- read (MGMR) progress report Main source for Background Material, slide backgrounds: Eran Halperin's Accurate Estimation of Expression.
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
RNA Quantitation from RNAseq Data
VCF format: variants c.f. S. Brown NYU
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Alexander Zelikovsky Computer Science Department
Reference based assembly
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Dec. 22, 2011 live call UCONN: Ion Mandoiu, Sahar Al Seesi
Sequence Analysis - RNA-Seq 2
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul

Outline Background Existing approaches Proposed Flow Datasets

Alternative Splicing

RNA-Seq ABCDE Make cDNA & shatter into fragments Sequence fragment ends Map reads Gene Expression (GE) ABC AC DE Isoform Discovery (ID) Isoform Expression (IE)

Existing approaches Genome-guided reconstruction – Exon identification – Genome-guided assembly Genome independent reconstruction – Genome-independent assembly Annotation-guided reconstruction – Explicitly use existing annotation during assembly

Genome-guided reconstruction (GGR) Scripture(2010) – Reports all isoforms Cufflinks(2010) – Reports a minimal set of isoforms Trapnell, M. et al MAY 2010, Guttman, M. et al MAY 2010

Genome independent reconstruction (GIR) Trinity(2011),Velvet(2008), TransABySS(2008) – de Brujin k-mer graph Efficiently construct graph from large amount of raw data Scoring algorithm to recover all plausible splice form Robustness to the noise steaming from sequencing errors Grabherr, M. et al. Nat. Biotechnol. JULY 2011

GGR vs GIR Garber, M. et al. Nat. Biotechnol. JUNE 2011

Max Set vs Min Set Garber, M. et al. Nat. Biotechnol. JUNE 2011

Reconstruction Strategies Comparison Grabherr, M. et al. Nat. Biotechnol. MAY 2011

IsoEM EM Algorithm for IE – Single and/or paired reads – Fragment length distribution – Strand information – Base quality scores Nicolae, M. et al.

IsoEM Validation on MAQC Samples RNA-Seq: 6 MAQC libraries, 47-92M 35bp reads each [Bullard et al. 10] qPCR: Quadruplicate measurements for 832 Ensembl genes [MAQC Consortium 06]

VSEM : Virtual String EM Estimate total frequency of missing transcripts Identify read spectrum sequenced from missing transcripts Mangul, S. et al. ML estimates of string frequencies ML estimates of string frequencies Compute expected read frequencies Compute expected read frequencies Update weights of reads in virtual string Update weights of reads in virtual string EM (Incomplete) Panel + Virtual String with 0-weights in virtual string (Incomplete) Panel + Virtual String with 0-weights in virtual string Virtual String frequency change>ε? Output string frequencies Output string frequencies EM YES NO

Proposed Flow Step 1: Read error correction Step 2: Maximum likelihood estimation of isoform frequencies and identification of unexplained reads Step 3: Read clustering Step 4: Read graph construction and candidate transcript generation. Continue Step 2

SOLiD RNA-Seq Datasets MCF7-SOLiD4 (April 2010) Paired End MCF7- SOLiD5500 (December 2010) Paired End MCF7- SOLiD5500 (December 2010) Frag Color MCF7- SOLiD5500 (December 2010) Frag ECC Base Total BAM records processed (valid records):540,187,060964,677,956447,491,122442,406,834 Total unmapped records:135,285,131249,120,11200 Total not primary records:0000 Total low mapQV(<10) records:125,776,254302,827,913116,983,995149,380,139 Not in any chromosome in the dictionary:12,483,85926,731,19418,800,6759,338,242 Total reads passing filters:266,641,816385,998,737311,706,452283,688,453 Counted on exons:202,347,590282,998,093232,539,004209,808,863 Counted on introns:32,366,42453,218,65944,321,42242,017,833 Counted intergenic:31,927,80249,781,98534,846,02631,861,757

Validation Datasets MAQC Sample : 1K transcripts – HBR (brain sample) – UHR (universal human reference)

Available Annotations NCBI UCSC Ensembl AceView Less conservative

Q/A