RNA-Seq Assembly: Fundamental Limits, Algorithms and Software TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse Stanford University Symposium on Turbo.

Slides:



Advertisements
Similar presentations
Reconstruction of DNA sequencing by hybridization Ji-Hong Zhang, Ling-Yun Wu and Xiang-Sun Zhang Institute of Applied Mathematics,
Advertisements

Marius Nicolae Computer Science and Engineering Department
RNA-Seq based discovery and reconstruction of unannotated transcripts
RNAseq.
Welcome Each of You to My Molecular Biology Class.
CSCE555 Bioinformatics Lecture 3 Gene Finding Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
12/04/2017 RNA seq (I) Edouard Severing.
Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.
Transcriptome Sequencing with Reference
Peter Tsai Bioinformatics Institute, University of Auckland
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Canadian Bioinformatics Workshops
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.
How many transcripts does it take to reconstruct the splice graph? Introduction Alternative splicing is the process by which a single gene may be used.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Introduction to Molecular Biology. G-C and A-T pairing.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
High Throughput Sequencing: Microscope in the Big Data Era
High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse EASIT Chinese University of Hong Kong.
mRNA-Seq: methods and applications
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
Delon Toh. Pitfalls of 2 nd Gen Amplification of cDNA – Artifacts – Biased coverage Short reads – Medium ~100bp for Illumina – 700bp for 454.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Science & Technology Centers Program Center for Science of Information Bryn Mawr Howard MIT Princeton Purdue Stanford Texas A&M UC Berkeley UC San Diego.
Todd J. Treangen, Steven L. Salzberg
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
Molecular Biology Primer for CS and engineering students Alan Qi Jan. 10, 2008.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Genomic sequencing in Plants R.Sakthivel
Next Generation DNA Sequencing
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
The iPlant Collaborative
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Molecular Biology Dr. Chaim Wachtel May 28, 2015.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
No reference available
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Information Theory of High-throughput Shotgun Sequencing David Tse Dept. of EECS U.C. Berkeley Tel Aviv University June 4, 2012 Research supported by NSF.
The Science of Information: From Communication to DNA Sequencing David Tse Dept. of EECS U.C. Berkeley UBC September 14, 2012 Research supported by NSF.
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Reverse-engineering transcription control networks timothy s
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
Science of Information: Case Studies in DNA and RNA assembly
How to Solve NP-hard Problems in Linear Time
Gene expression estimation from RNA-Seq data
Reference based assembly
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Sequence Analysis - RNA-Seq 2
Presentation transcript:

RNA-Seq Assembly: Fundamental Limits, Algorithms and Software TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse Stanford University Symposium on Turbo Codes and Iterative Information Processing Bremen, Germany August 20, 2014 Joint work with Sreeram Kannan and Lior Pachter. Research supported by NSF Center for Science of Information.

Communication system design 1) Establish fundamental limits. 2) Design codes and algorithms to approach the limit. 3) Implement a system. We apply this methodology to the RNA-Seq assembly problem.

DNA sequencing …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

High throughput sequencing revolution

Shotgun sequencing read

Sequencing Technologies SequencerSanger 3730xl 454 GSIon Torrent SOLiDv 4 Illumina HiSeq 2000 Pac Bio Mechanism Dideoxy chain termination Pyrosequ encing Detection of hydrogen ion Ligation and two- base coding Reversible nu cleotides Single molecule real time Read length bp700 bp~400 bp bp 100 bp PE1000~10000 bp Error Rate0.001%0.1%2%0.1%2%10-15% Output data (per run) 100 KB1 GB100 GB 1 TB10 GB

High throughput sequencing: Microscope in the big data era Assembly Genomic variations, 3-D structures, transcription, translation, protein interaction, etc. Today’s focus: RNA sequencing.

Central dogma of molecular biology RNA transcripts and their abundances capture the dynamic state of a cell at a given time. DNA RNAProtein transcription translation

From DNA to RNA ATC GAT CAT TCG ATC CAT TCG GAT TCG DNA RNA Transcript 1 RNA Transcript 2 Intron Exon AC TGAA AGC Alternative splicing yields different isoforms. 1000’s to 10,000’s symbols long

Transcriptome ATC CAT TCG GAT TCG 20 copies in cell 30 copies in cell Different transcripts are present at different abundances. Transcriptome is the mixture of transcripts from all the genes. Human transcriptome has 10,000’s of transcripts from 20,000 genes.

RNA-Seq (Mortazavi et al, Nature Methods 08)

RNA-Seq assembly ATC CAT TCG GAT TCG ATC CAT TCG GAT TCG GAT TCG TTC GAT TCG Reads Assembler reconstructs Transciptome

RNA assembly: state-of-the-art Source: Wei Li et al, JCB 2011, Data from ENCODE project IsoLasso Scripture Cufflinks Popular assemblers diverge significantly when fed the same input

Assembly as a software engineering problem A single sequencing experiment can generate 100’s of millions of reads, 10’s to 100’s gigabytes of data. Primary concerns are to minimize time and memory requirements. No guarantee on optimality of assembly quality and in fact no optimality criterion at all.

A new approach Establish information theoretic limits under simplifying assumptions. Design an assembly algorithm that achieves close to the limits. Build software and test on simulated and real data.

Information theoretic limits Basic question: What is the length, number and error rate of the reads needed for reliable reconstruction of a transcriptome? A simplified question: What is the minimum read length L critical needed, assuming infinite noiseless reads? (cf. earlier work on DNA assembly: Bresler, Bresler and T BMC Bioinformatics, Motahari et al 2013 ISIT)

Sequencing Technologies SequencerSanger 3730xl 454 GSIon Torrent SOLiDv4Illumina HiSeq 2000 Pac Bio MechanismDideoxy chain termination Pyroseq uencing Detection of hydrogen ion Ligation and two- base coding Reversibl e Nucleotid es Single molecule real time Read length bp700 bp~400 bp bp100 bp PE 1000~10000 bp Error Rate0.001%0.1%2%0.1%2%10-15% Output data (per run) 100 KB1 GB100 GB 1 TB10 GB

L crit depends on repeats L critical is a measure of repeat complexity of the transcriptome from the point of view of assembly.

What is L critical for a transcriptome? L critical depends on: intra-transcript repeats inter-transcript repeats on the transcriptome.

Intra-transcript repeats: interleaved repeats L-1 L a single transcript L critical is lower bounded by the length of the longest intra-transcript interleaved repeat.

Inter-transcript repeats L critical is typically much larger due to inter-transcript repeats of exons across isoforms. ATC CAT TCG GAT TCG ATC CAT TCG GAT TCG GAT TCG 100’s of symbols

s1s1 s1s1 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s3s3 s3s3 s5s5 s5s5 s1s1 s1s1 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s3s3 s3s3 s5s5 s5s5 Ambiguity due to inter-transcript repeats L-1 transcript 1 transcript 2 L = read length

s1s1 s1s1 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s3s3 s3s3 s5s5 s5s5 s1s1 s1s1 s3s3 s3s3 s4s4 s4s4 s1s1 s1s1 s3s3 s3s3 s5s5 s5s5 Ambiguity due to inter-transcript repeats L-1 transcript 1 transcript 2 transcript 3 transcript 4 L = read length

Abundance diversity lymphoblastoid cell line Geuvadis dataset

s1s1 s1s1 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s3s3 s3s3 s5s5 s5s5 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s1s1 s1s1 s3s3 s3s3 s5s5 s5s5 Equal abundanceGeneric abundances s1s1 s1s1 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s3s3 s3s3 s5s5 s5s5 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s1s1 s1s1 s3s3 s3s3 s5s5 s5s5 b b a b c ? Unique generic solution, also sparse L-1

Unresolvable intra-transcript repeats with generic abundances abundances Yields a lower bound for L critical for a given transcriptome. s1s1 s1s1 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s3s3 s3s3 s5s5 s5s5 a b c s2s2 s2s2 s3s3 s3s3 s4s4 s4s4 s1s1 s1s1 s3s3 s3s3 s4s4 s4s4 s1s1 s1s1 s3s3 s3s3 s5s5 s5s5 a-c b+c c alternative solution:

Algorithm: reduction to sparsest flow Create a splice graph where each node is an exon. Read copy counts give edge flows Transcripts are extracted via solving a sparsest flow problem. s1s1 s1s1 s2s2 s2s2 s3s3 s3s3 s4s4 s4s4 s5s5 s5s s1s1 s1s1 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s3s3 s3s3 s5s5 s5s

Sparsest Flow Decomposition Problem is NP-Hard. [Vatinlen et al’ 08, Hartman et al ’12] –Closer look at hard instances: most paths have same flow –Equivalent to: Most transcripts have same abundance (!) –This is not characteristic of the biological problem Our Result: –Assume that abundances are generic –Propose a provably correct algorithm that reconstructs when – L > L suff –Algorithm is linear time under this condition.

Informational limits: summary L critical of a transcriptome: Read Length, L 0 L critical No algo. can reconstruct Proposed algo. can reconstruct in linear time On many reference transcriptomes, these two bounds match, establishing L critical !

From theory to software Transcripts as paths Sparsest decomposition of edge-flow into paths Deals with inter-transcript repeats Aggregate abundance estimation Node-wise copy count estimates Smoothing CC estimates using min-cost network flow Multibridging to construct splice graph Condensation and intra- transcript repeat resolution Identify and discard sequencing errors

ShannonRNA: simulated reads Coverage depth of transcripts Sensitivity (fraction of transcripts recovered) Specificity (false positive rate) Chr 15 Gencode reference transcriptome, 1700 transcripts L= 100, 1M reads, 1% error rate

Performance on real reads RNA sample from Human Embryonic Stem Cell Simultaneously sequenced using long Pacbio reads and short Illumina reads Long reads are fewer in number Read length=50, 20 Million reads Long read assembly as a proxy for ground truth. [Au et al, PNAS 2013]

Coverage Depth of Transcripts ShannonRNA: real reads Running Time Trinity: 3 hrs ShannonRNA: 5 hrs No. Transcripts: 800 ShannonRNA: 527 Trinity: 476 Sensitivity (fraction of transcripts recovered)

Abundance of Transcripts Reconstructed (Segregated by number of Isoforms) Zooming In

Summary An approach to RNA assembly design based on principles of information theory. Driven by and tested on transcriptomics data. Goal is to build robust, scalable software with performance guarantees.

Acknowledgements Sreeram Kannan Berkeley Lior Pachter Berkeley Joseph Hui Berkeley Kayvon Mazooji Berkeley