RNA-Seq Assembly: Fundamental Limits, Algorithms and Software TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse Stanford University Symposium on Turbo.

RNA-Seq Assembly: Fundamental Limits, Algorithms and Software TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse Stanford University Symposium on Turbo Codes and Iterative Information Processing Bremen, Germany August 20, 2014 Joint work with Sreeram Kannan and Lior Pachter. Research supported by NSF Center for Science of Information.

Communication system design 1) Establish fundamental limits. 2) Design codes and algorithms to approach the limit. 3) Implement a system. We apply this methodology to the RNA-Seq assembly problem.

DNA sequencing …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

High throughput sequencing revolution

Shotgun sequencing read

Sequencing Technologies SequencerSanger 3730xl 454 GSIon Torrent SOLiDv 4 Illumina HiSeq 2000 Pac Bio Mechanism Dideoxy chain termination Pyrosequ encing Detection of hydrogen ion Ligation and two- base coding Reversible nu cleotides Single molecule real time Read length 400-900 bp700 bp~400 bp50 + 50 bp 100 bp PE1000~10000 bp Error Rate0.001%0.1%2%0.1%2%10-15% Output data (per run) 100 KB1 GB100 GB 1 TB10 GB

High throughput sequencing: Microscope in the big data era Assembly Genomic variations, 3-D structures, transcription, translation, protein interaction, etc. Today’s focus: RNA sequencing.

Central dogma of molecular biology RNA transcripts and their abundances capture the dynamic state of a cell at a given time. DNA RNAProtein transcription translation

From DNA to RNA ATC GAT CAT TCG ATC CAT TCG GAT TCG DNA RNA Transcript 1 RNA Transcript 2 Intron Exon AC TGAA AGC Alternative splicing yields different isoforms. 1000’s to 10,000’s symbols long

Transcriptome ATC CAT TCG GAT TCG 20 copies in cell 30 copies in cell Different transcripts are present at different abundances. Transcriptome is the mixture of transcripts from all the genes. Human transcriptome has 10,000’s of transcripts from 20,000 genes.

RNA-Seq (Mortazavi et al, Nature Methods 08)

RNA-Seq assembly ATC CAT TCG GAT TCG ATC CAT TCG GAT TCG GAT TCG TTC GAT TCG Reads Assembler reconstructs Transciptome

RNA assembly: state-of-the-art Source: Wei Li et al, JCB 2011, Data from ENCODE project 24243 7553 9741 6457 448216 59647 5588 IsoLasso Scripture Cufflinks Popular assemblers diverge significantly when fed the same input

Assembly as a software engineering problem A single sequencing experiment can generate 100’s of millions of reads, 10’s to 100’s gigabytes of data. Primary concerns are to minimize time and memory requirements. No guarantee on optimality of assembly quality and in fact no optimality criterion at all.

A new approach Establish information theoretic limits under simplifying assumptions. Design an assembly algorithm that achieves close to the limits. Build software and test on simulated and real data.

Information theoretic limits Basic question: What is the length, number and error rate of the reads needed for reliable reconstruction of a transcriptome? A simplified question: What is the minimum read length L critical needed, assuming infinite noiseless reads? (cf. earlier work on DNA assembly: Bresler, Bresler and T. 2013 BMC Bioinformatics, Motahari et al 2013 ISIT)

Sequencing Technologies SequencerSanger 3730xl 454 GSIon Torrent SOLiDv4Illumina HiSeq 2000 Pac Bio MechanismDideoxy chain termination Pyroseq uencing Detection of hydrogen ion Ligation and two- base coding Reversibl e Nucleotid es Single molecule real time Read length 400-900 bp700 bp~400 bp50 + 50 bp100 bp PE 1000~10000 bp Error Rate0.001%0.1%2%0.1%2%10-15% Output data (per run) 100 KB1 GB100 GB 1 TB10 GB

L crit depends on repeats L critical is a measure of repeat complexity of the transcriptome from the point of view of assembly.

What is L critical for a transcriptome? L critical depends on: intra-transcript repeats inter-transcript repeats on the transcriptome.

Intra-transcript repeats: interleaved repeats L-1 L a single transcript L critical is lower bounded by the length of the longest intra-transcript interleaved repeat.

Inter-transcript repeats L critical is typically much larger due to inter-transcript repeats of exons across isoforms. ATC CAT TCG GAT TCG ATC CAT TCG GAT TCG GAT TCG 100’s of symbols

s1s1 s1s1 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s3s3 s3s3 s5s5 s5s5 s1s1 s1s1 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s3s3 s3s3 s5s5 s5s5 Ambiguity due to inter-transcript repeats L-1 transcript 1 transcript 2 L = read length

s1s1 s1s1 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s3s3 s3s3 s5s5 s5s5 s1s1 s1s1 s3s3 s3s3 s4s4 s4s4 s1s1 s1s1 s3s3 s3s3 s5s5 s5s5 Ambiguity due to inter-transcript repeats L-1 transcript 1 transcript 2 transcript 3 transcript 4 L = read length

Abundance diversity lymphoblastoid cell line Geuvadis dataset

s1s1 s1s1 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s3s3 s3s3 s5s5 s5s5 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s1s1 s1s1 s3s3 s3s3 s5s5 s5s5 Equal abundanceGeneric abundances s1s1 s1s1 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s3s3 s3s3 s5s5 s5s5 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s1s1 s1s1 s3s3 s3s3 s5s5 s5s5 b b a b c ? Unique generic solution, also sparse L-1

Unresolvable intra-transcript repeats with generic abundances abundances Yields a lower bound for L critical for a given transcriptome. s1s1 s1s1 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s3s3 s3s3 s5s5 s5s5 a b c s2s2 s2s2 s3s3 s3s3 s4s4 s4s4 s1s1 s1s1 s3s3 s3s3 s4s4 s4s4 s1s1 s1s1 s3s3 s3s3 s5s5 s5s5 a-c b+c c alternative solution:

Algorithm: reduction to sparsest flow Create a splice graph where each node is an exon. Read copy counts give edge flows Transcripts are extracted via solving a sparsest flow problem. s1s1 s1s1 s2s2 s2s2 s3s3 s3s3 s4s4 s4s4 s5s5 s5s5 0.12 0.88 0.12 0.88 s1s1 s1s1 s3s3 s3s3 s4s4 s4s4 s2s2 s2s2 s3s3 s3s3 s5s5 s5s5 0.12 0.88

Sparsest Flow Decomposition Problem is NP-Hard. [Vatinlen et al’ 08, Hartman et al ’12] –Closer look at hard instances: most paths have same flow –Equivalent to: Most transcripts have same abundance (!) –This is not characteristic of the biological problem Our Result: –Assume that abundances are generic –Propose a provably correct algorithm that reconstructs when – L > L suff –Algorithm is linear time under this condition.

Informational limits: summary L critical of a transcriptome: Read Length, L 0 L critical No algo. can reconstruct Proposed algo. can reconstruct in linear time On many reference transcriptomes, these two bounds match, establishing L critical !

From theory to software Transcripts as paths Sparsest decomposition of edge-flow into paths Deals with inter-transcript repeats Aggregate abundance estimation Node-wise copy count estimates Smoothing CC estimates using min-cost network flow Multibridging to construct splice graph Condensation and intra- transcript repeat resolution Identify and discard sequencing errors

ShannonRNA: simulated reads Coverage depth of transcripts Sensitivity (fraction of transcripts recovered) Specificity (false positive rate) Chr 15 Gencode reference transcriptome, 1700 transcripts L= 100, 1M reads, 1% error rate

Performance on real reads RNA sample from Human Embryonic Stem Cell Simultaneously sequenced using long Pacbio reads and short Illumina reads Long reads are fewer in number Read length=50, 20 Million reads Long read assembly as a proxy for ground truth. [Au et al, PNAS 2013]

Coverage Depth of Transcripts ShannonRNA: real reads Running Time Trinity: 3 hrs ShannonRNA: 5 hrs No. Transcripts: 800 ShannonRNA: 527 Trinity: 476 Sensitivity (fraction of transcripts recovered)

Abundance of Transcripts Reconstructed (Segregated by number of Isoforms) Zooming In

Summary An approach to RNA assembly design based on principles of information theory. Driven by and tested on transcriptomics data. Goal is to build robust, scalable software with performance guarantees.

Acknowledgements Sreeram Kannan Berkeley Lior Pachter Berkeley Joseph Hui Berkeley Kayvon Mazooji Berkeley

RNA-Seq Assembly: Fundamental Limits, Algorithms and Software TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse Stanford University Symposium on Turbo.

Similar presentations

Presentation on theme: "RNA-Seq Assembly: Fundamental Limits, Algorithms and Software TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse Stanford University Symposium on Turbo."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RNA-Seq Assembly: Fundamental Limits, Algorithms and Software TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse Stanford University Symposium on Turbo.

Similar presentations

Presentation on theme: "RNA-Seq Assembly: Fundamental Limits, Algorithms and Software TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse Stanford University Symposium on Turbo."— Presentation transcript:

Similar presentations

About project

Feedback