CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.

Slides:



Advertisements
Similar presentations
RNA-seq library prep introduction
Advertisements

The Past, Present, and Future of DNA Sequencing
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
RNAseq.
Click to edit Master title style Irys data analysis January 10 th, 2014.
Transcriptome Sequencing with Reference
Peter Tsai Bioinformatics Institute, University of Auckland
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
Tyson A. Clark, Ph.D. February 11, 2015
CSE182-L12 Gene Finding.
Henrik Lantz - BILS/SciLife/Uppsala University
High Throughput Sequencing
mRNA-Seq: methods and applications
Titus Brown Qingpeng Zhang John Blischak Welcome!.
De-novo Assembly Day 4.
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Next Generation DNA Sequencing
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.
The iPlant Collaborative
RNA Sequencing I: De novo RNAseq
RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
billion-piece genome puzzle
De novo assembly validation
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
The iPlant Collaborative
No reference available
Sequencing technologies and Velvet assembly Lecturer : Du Shengyang September 29 , 2012.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -
Library QA & QC Day 1, Video 3
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
RNA Quantitation from RNAseq Data
Canadian Bioinformatics Workshops
Primer design.
Kallisto: near-optimal RNA seq quantification tool
Misleading Bioinformatics Mistakes, Biases, Mis-Interpretations and how to avoid them Festival of Genomics 2017.
RNA sequencing (RNA-Seq) and its application in ovarian cancer
Maximize read usage through mapping strategies
Schematic representation of a transcriptomic evaluation approach.
Sequence Analysis - RNA-Seq 1
Presentation transcript:

CyVerse Workshop Transcriptome Assembly

Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly Characterize Transcript abundance Visualization We will focus mostly on the transcriptome assembly itself as this is the greatest challenge to working without a reference genome and involves several steps.

Reminder: you’ll need to do some thinking and reading here Transcriptome assembly Good read data Assemble Transcriptome On to mapping The most costly (time, compute) process is empirically determining what parameters yield the best assembly given the data and organism. We can suggest some good directions, but we can’t promise “right” answers Evaluate and refine Get familiar with examples of work and reasoning on approaches relevant to your organism

Challenges (just a few of them) Transcriptome assembly from RNA-Seq DNA assembly assumes even sequencing depth, not true in RNA-Seq (e.g. repetitive regions = more reads for DNA, more expression for RNA), also higher coverage is needed for de novo assembly from RNA-Seq (30X or more) Sequencing error correction, esp. in highly expressed transcripts Multiple transcript variants (splicing) can confound assembly

Some reminders Transcriptome assembly from RNA-Seq Practically the only agreement on which software is the right one to use is that there is no “right” one (we’ll need to experiment*) Don't: use less than 200 to 500 Million RNA reads, mate-paired, of 100 bp or better length, high quality, and expect to get a complete transcriptome. (1) Despite the challenges, making your own transcriptome is still very very useful and perhaps more practical than assembling your own genome. 1. How to get Best mRNA Transcript assemblies. by Don Gilbert, 2013 Jan

Overview Transcriptome assembly from RNA-Seq Pre-analysis: Data generation

Generate the best data you can! Experimental Consideration – Library prep Remove ribosomal RNAs Two options with some pros/cons Poly(A) selection – effective, but will miss ncRNAs and non polyadenlyated transcripts rRNA depletion – hybridize rRNAs and remove, but may introduce different biases (e.g. against highly expressed transcripts)

Generate the best data you can! Experimental Consideration – Library prep PCR Amplification Most protocols have a PCR amplification step – this of course introduces bias (e.g. against high GC content). Some alternative protocols or technologies (PacBio) can avoid amplification but again have their own issues.

Generate the best data you can! Experimental Consideration – Library prep Strand specificity If possible, doing a strand-specific protocol can simplify future analyses

Generate the best data you can! Experimental Consideration – Library prep nscriptome-analysis-with-strand-specific-libraries No orientation Strandedness preserved

Overview SOAPdenovo-Trans Analysis

Overview SOAPdenovo-Trans SOAPdenovo-Trans is a de novo transcriptome assembler Sensitive to alternative splicing and different expression level among transcripts. Construct full-length transcript sets from RNA-Seq read data

Some comparisons Why SOAPdenovo-Trans? *Runs more quickly (easier to refine parameters) Less memory demands Good quality (Software changes rapidly, so “clear winners” will always change, you can too when the time comes)

Overview SOAPdenovo-Trans De Brujin graphs are constructed Error correction Contigs are constructed and single/paired reads are mapped to contigs to make scaffold graphs Transcripts are created from scaffold graphs Figure from: SOAPdenovo-Trans: De novo transcriptome assembly with shortf RNA-Seq reads Yinlong Xie1,2,3,†, Gengxiong Wu1,†, et.al.1BGI-Shenzhen, Shenzhen, China.

Kmers and De Brujin graphs SOAPdenovo-Trans Next-generation transcriptome assembly, Jeffrey A. Martin and Zhong Wang – Nat.Reviw.Gen doi: /nrg3068 Published online 7 September 2011 Reads split into k- mers De Brujin graph constructed from kmers

Kmers and De Brujin graphs SOAPdenovo-Trans Next-generation transcriptome assembly, Jeffrey A. Martin and Zhong Wang – Nat.Reviw.Gen doi: /nrg3068 Published online 7 September 2011 Redundancies are collapsed Paths through the graph that explained the observed sequence generate the alignments

Choosing kmers SOAPdenovo-Trans De novo assembly and analysis of RNA-seq data. Nature Methods 7, 909–912 (2010) doi: /nmeth.1517 Transcripts with lower read depths were represented better with lower K values Transcripts with higher read depth represented better with higher K

Kmer rules of thumb SOAPdenovo-Trans Larger K – more specific (better for heterozygous genomes) Smaller K – more sensitive (better for repetitive genomes) Use a range of sizes (31,41,51, etc.) and merge the results Recommendations 1/2 or 2/3 read length can be found in various places online. Remember to use an odd number

Key Results SOAPdenovo-Trans.scafSeq – Assembled transcript sets.scafStatistics – Collection of statistics describing the population of transcript sequences in the assembly.contig - The assembled contigs used to create the scaffolds.agp - Describes the assembly of contigs from reads and scaffolds from contigs. May be used with other programs to visualize the assembly in detail

Where to go from here SOAPdenovo-Trans Examine the quality of the assembly o N50 statistic o Core gene representation

N50 Assembly quality

How good is assembly coverage? CEGMA CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes Bioinformatics (2007) 23 (9): doi: /bioinformatics/btm071 First published online: March 1, 2007

Asian honeybee transcriptome Sample Data

Detailed instructions with videos, manuals, documentation in Keep asking: ask.iplantcollabortive.org