Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone.

Slides:



Advertisements
Similar presentations
RNA-Seq as a Discovery Tool
Advertisements

RNA-seq library prep introduction
Functional Genomics with Next-Generation Sequencing
The Past, Present, and Future of DNA Sequencing
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
RNAseq.
Walk-thru of CAGE exercise Also at /tag_analysis/ /tag_analysis/
Transcriptome Sequencing with Reference
Peter Tsai Bioinformatics Institute, University of Auckland
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
Tyson A. Clark, Ph.D. February 11, 2015
Tutorial 7 Genome browser. Free, open source, on-line broswer for genomes Contains ~100 genomes, from nematodes to human. Many tools that can be used.
CSE182-L12 Gene Finding.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Bioinformatics Alternative splicing Multiple isoforms Exonic Splicing Enhancers (ESE) and Silencers (ESS) SpliceNest Lecture 13.
Chris Chander, Luke Adea BioSci D145 Feb. 12, 2015
Next generation sequencing Why? What? How? Marcel Dinger Developmental Biology Divisional Seminar 7 October 2010.
mRNA-Seq: methods and applications
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
Delon Toh. Pitfalls of 2 nd Gen Amplification of cDNA – Artifacts – Biased coverage Short reads – Medium ~100bp for Illumina – 700bp for 454.
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Doug Brutlag 2011 Next Generation Sequencing and Human Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University.
Li and Dewey BMC Bioinformatics 2011, 12:323
The Ensembl Gene set The “Genebuild” 21 April 2008.
Ji-hye Choi August Introduction (2006) ABRF-NGS (the Association fo Biomolecular Resource Facilities next-generation sequencing study)
Todd J. Treangen, Steven L. Salzberg
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Understanding genes using mathematical tools Adam Sartiel COMPUGEN.
The iPlant Collaborative
RNA Sequencing I: De novo RNAseq
RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Genomics.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Accessing and visualizing genomics data
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Canadian Bioinformatics Workshops
RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Approach Extract all annotated exons (Refseq and KnownGene) in region plus buffer (200KB up & down stream. Extract all conserved elements with conservation.
The Transcriptional Landscape of the Mammalian Genome
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Gene expression from RNA-Seq
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Transcriptome Assembly
Do You Want to Build a Transcriptome?
From: TopHat: discovering splice junctions with RNA-Seq
Introduction to Bioinformatics II
Next Generation Sequencing and Human Genome Databases
Volume 84, Issue 3, Pages (February 1996)
RNA sequencing (RNA-Seq) and its application in ovarian cancer
Dec. 22, 2011 live call UCONN: Ion Mandoiu, Sahar Al Seesi
Transcript length distribution resulting from different assemblies of the embryo samples across the three technologies (HiSeq, MiSeq, and PacBio). Transcript.
Mapping rates of different transcript sets to the P
Introduction to Alternative Splicing and my research report
Universal Alternative Splicing of Noncoding Exons
Sequence Analysis - RNA-Seq 2
Schematic representation of a transcriptomic evaluation approach.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Volume 11, Issue 7, Pages (May 2015)
Presentation transcript:

Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone Institutes Pacific Biosciences Sean Thomas and Alisha Holloway

Overarching goal: understand gene regulation during heart development and why children are born with congenital heart defects Accelerate discovery to clinical practice by fostering collaborations of basic, translational and clinical researchers Motivation

Chicken hearts are being used as models of cardiac development chickenhuman

Motivation Functional genomics studies of the molecular mechanisms behind cardiac development require solid genome and transcript annotations.

Motivation Poor annotations are common for many model organisms that could be useful for understanding heart development and evolution

Motivation Koshiba-Takeuchi et al. 2009, Nature Turtle Chicken Tbx5 expression

Motivation Current best chicken annotations, as of 2012*: Ensembl and refSeq refSeq annotation contained only 6,459 transcripts, but were well- polished Ensembl annotation contained ~20k transcripts but with many errors mouse and chicken have similar genome sizes and numbers of genes, but Ensembl annotation for mouse has ~95k transcripts *galGal3 assembly

Motivation available annotations were unreliable

Motivation available annotations were unreliable conservation Ensembl annotation non-chicken refSeq RNA-seq data Znf503, a likely regulator of heart development

Acquired deep short read data from many tissue types Illumina data – 150 million uniquely mapping fragments Tissues –brain, cerebellum, heart, kidney, liver, testicle Acquired EST data from existing databases Employed de novo transcriptome assembly tools to generate annotation Trinity MAKER Solution?De novo transcriptome assembly of short read data and EST data

Blue boxes are exons Black lines show exons joined by: 1. exon spanning reads 2. paired-end reads Solution? Assembly of exons is possible with short reads but assembling isoforms is trickier…

Solution?

? Three exons can be joined by: 1. one end of a pair mapping to exon 1 2. other end spanning exons 2 & bp Can’t join because exon 3 longer than insert size

Assembly of exons is possible with short reads but assembling isoforms is trickier… Assembly of Illumina reads yielded 120k distinct contigs with average length of ~600bp, well below median transcript length Solution? Median Transcript Length Chicken – 1.2kb Mouse – 1.5kb

isolate mRNA generate cDNA size-select 0-1kb 1-2 kb 2-3 kb 3+ kb create libraries sequence post-processing ID full length reads trim primers & polyA tails sequencing embryonic chicken hearts error correction alignment gene modeling long error-prone read short high- fidelity reads xxxxx long corrected read galGal4 genome methodology

Terminology 1,508,184 subreads mapped uniquely (~72%) to galGal4 assembly PacBio SMRTbell transcription cDNA insert full length subread pacBio read

All Subreads – includes incompletely sequenced transcript HQRegion Subread – completely sequenced >= 1x HQRegion Full-Pass Subreads – completely sequenced >= 2x Most reads cover full length of transcript

Most reads begin at the 5’ end of transcripts and end at the 3’ end

transcript length Coverage of refSeq transcripts

Example of data…good existing annotation

Illumina Ensembl RefSeq PacBio data Example of current coverage…

New ensembl annotation based on galGal4 fixed many of the issues that motivated our efforts New genome assembly, new annotation

Remember this gene? available annotations were unreliable conservation ensembl annotation non-chicken refSeq RNA-seq data Znf503, a likely regulator of heart development

galGal3 galGal4 ensembl non-chicken refSeq Annotation improvement non-chicken refSeq Znf503, a likely regulator of heart development

New ensembl annotation based on galGal4 fixed many of the issues that motivated our efforts However, the PacBio data contains ~2,000 transcripts that represent improvements to even this newest annotation New genome assembly, new annotation

Categories of annotation improvements… B2B PacBio isoforms Ensembl 2013 annotation RefSeq annotation Illumina tag density corrected genes missing from Ensembl

Categories of annotation improvements… B2B PacBio isoforms Ensembl 2013 annotation RefSeq annotation Illumina tag density corrected exons missing from Ensembl

Categories of annotation improvements… B2B PacBio isoforms Ensembl 2013 annotation RefSeq annotation Illumina tag density identify completely new isoforms

Categories of annotation improvements… B2B PacBio isoforms Ensembl 2013 annotation RefSeq annotation Illumina tag density corrected false exons

Categories of annotation improvements… B2B PacBio isoforms Ensembl 2013 annotation RefSeq annotation Illumina tag density identify new transcription start sites

Categories of annotation improvements… B2B PacBio isoforms Illumina tag density identify new low-abundance genes/exons

Mapped ends of PacBio reads (GMAP) exhibit systematic splice donor site errors

conservation Illumina tag density Ensembl PacBio raw GMAP alignments (1,2) (3) Peculiar buildup at 3’ end of reads…

1. short reads aligned to long read long error-prone read x short high-fidelity reads long corrected read 2. consensus of aligned reads corrects error Error correction wasn’t really useful in this case (good underlying genome build)

1.New Ensembl annotation fixed many problematic transcripts 2.PacBio data added another 2,000 transcripts to the set expressed in embryonic chicken hearts Recommendations for others with similar projects 1.Select mRNAs with mature 5’cap and poly-A tail to ensure full length transcript 2.Perform normalization using double stranded nuclease to get greater coverage 3.Don’t worry about error correction if you’ve got a good reference genome 4.Be aware of some of the systematic errors associated with mapping results Summary and recommendations

Acknowledgements Jason Underwood Elizabeth Tseng Luke Hickey Alisha Holloway Sean Thomas