How to design arrays with Next generation sequencing (NGS) data Lecture 2 Christopher Wheat.

Slides:



Advertisements
Similar presentations
RNA-seq library prep introduction
Advertisements

Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone.
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
RNAseq.
Peter Tsai Bioinformatics Institute, University of Auckland
RNA-seq: the future of transcriptomics ……. ?
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Bioinformatics at WSU Matt Settles Bioinformatics Core Washington State University Wednesday, April 23, 2008 WSU Linux User Group (LUG)‏
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Next-generation sequencing and PBRC. Next Generation Sequencer Applications DeNovo Sequencing Resequencing, Comparative Genomics Global SNP Analysis Gene.
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
MCB Lecture #21 Nov 20/14 Prokaryote RNAseq.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
High Throughput Sequencing
mRNA-Seq: methods and applications
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
Diabetes and Endocrinology Research Center The BCM Microarray Core Facility: Closing the Next Generation Gap Alina Raza 1, Mylinh Hoang 1, Gayan De Silva.
Todd J. Treangen, Steven L. Salzberg
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
RNAseq analyses -- methods
Fig Chapter 12: Genomics. Genomics: the study of whole-genome structure, organization, and function Structural genomics: the physical genome; whole.
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Targeted next generation sequencing for population genomics and phylogenomics in Ambystomatid salamanders Eric M. O’Neill David W. Weisrock Photograph.
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
The iPlant Collaborative
Click to edit Master title style Click to edit Master subtitle style CLICKER QUESTIONS For CAMPBELL BIOLOGY, NINTH EDITION Jane B. Reece, Lisa A. Urry,
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Tag profiling is dead... October 2009 Claudia Voelckel Patrick Biggs...long live mRNA-Seq!
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Bombus terrestris, the buff-tailed bumble bee Native to Europe A managed pollinator Commercially available Reared in greenhouses Important pollinator in.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Day 5-2 What bioinformatics.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Introduction to RNAseq
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Geuvadis Analysis Meeting 16/02/2012 Micha Sammeth CNAG – Barcelona.
No reference available
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Accessing and visualizing genomics data
Affymetrix User’s Group Meeting Boston, MA May 2005 Keynote Topics: 1. Human genome annotations: emergence of non-coding transcripts -tiling arrays: study.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
de Novo Transcriptome Assembly
Short Read Sequencing Analysis Workshop
The Transcriptional Landscape of the Mammalian Genome
Quality Control & Preprocessing of Metagenomic Data
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Canadian Bioinformatics Workshops
Transcriptome Assembly
Discovery tools for human genetic variations
Genome organization and Bioinformatics
Sequence Analysis - RNA-Seq 2
Presentation transcript:

How to design arrays with Next generation sequencing (NGS) data Lecture 2 Christopher Wheat

Outline  Transcriptome sequencing  Assembly  Assessing assembly  Annotation  Calling SNPs  Designing probes  Making decisions about experimental design 2

48k contigs Vera and Wheat et al ≈10,000 genes in triplicate Gene annotation

Getting the genes  Cheapest method is to directly sequence them  Sequence the transcritptome  Challenges  Getting right tissue, timing, induction, etc.  Getting the population variation (SNPs, indels, etc.)  Getting the high quality RNA  Choosing a sequencing method  Assembling the data and assessing it  Annotating the data 4

Pool? Yes! Normalize? Maybe …. 5

 8 day old aerial tissue, A. thaliana seedlings  Run 1 touched 17,449 gene models (60% of genes)  Run 2 only touched 10% more  Microarray studies indicate 55-67% of genes expressed in this tissue  They estimate they have 90% of transcriptome in the tissue Weber et al. 2007

Roche 454  Fundamental tradeoffs in read: length vs. depth vs. cost Illumina Length: 400 vs. 2 x 100 bp Depth: 1.2 E6 vs. 300 E6 reads Costs: 10,000 Euros vs Euros long but shallow short but deep vs.

Roche 454  Stats per run:  bp  1.2 E6 reads  500 MBp  0.5 days  10,000 euro?

Flow diagram  TCAGCGTAAGG GGGG

Huse et al. 2007

Illumina Illumina, Inc.

Illumina  Stats:  2 x 100 bp  E6 reads  GBp  9.5 days  3,000 euros? Illumina, Inc.

 Dephasing limits read length  No homopolymer runs issues  due to difference in sequence by synthesis method  Per read error rate  current estimate is very low  Correction methods  quality scores and bioinformatics

Which to use? Illumina PE because there is so much more data generated per euro, for good transcriptome coverage and thus assembly of even low expressed genes or rare isoforms (do your own price comparisons) 14

Challenge: Bioinformatics  Assembly  Transcriptome (all the above issues)  SNPs, indels, CNV, repeated elements, error  Fragmented assembly is the norm  Alternative splicing  Software  Trinity, Oasis, TransAbyss, Seqman,CAP3,Mira2, Newbler, CLC, etc.  Settings  Many methods, few studies comparing their performance  But see Kumar and Blaxter, and Trinity paper.  Computational power (beyond HD space):  CPU vs RAM: tends to be RAM intensive,

Learn bioinformatics, hire a bioinformatician, buy expensive software …. All comes down to time and money …. But there is also no “perfect” way to do something, as each species appears to be a bit different, so comparing different methods is the best route CLC is a very nice, accessible commercial package, but like all things, it requires a fast computer. 16

Blast against what?  Important to determine a genomic reference species  Predicted gene models for comparison  Need species with predicted gene set ideally < 100 million years divergent  Many genes should be shared  Even divergent species are useful for assessing assembly run method X parameters  Compare results 17

Predicted genes: D. melanogaster = 13,379 B. mori = 18,510 Estimated coverage: 70% D. mel estimate 50% B. mori estimate But how much of each gene does each contig assemble? How much fragmentation?

But what do these numbers mean?  45,000 contigs had blast hit to 9000 gene models in another species  What are these gene models? Are isoforms included?  Filtering the predicted gene set to remove isoforms and recent duplicates helps greatly  RBB90 dataset is useful. 19 Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

Potential Blast bias source20 Wheat, C. W. (2012) SNP Discovery in Non-model Organisms Using 454 Next Generation Sequencing.

Metabolic Map Comparison Bombyx mori with WGS M. cinxia with 454 seq.

Upper estimate of 70% = 13,142 genes Wheat 2008

Assessing De novo transcriptome assembly Vera & Wheat et al Mol. Ecol. Nearest WGS: Focal species 454: > 1 = 1 < 1

Hornett & Wheat et al. 2012

Relative ortholog coverage

Ex. 6 species assemblies with blast result insights  454 EST libraries  22 genes assessed for sequence coverage

Alternative splicing  > 80% in humans  > 40 % in fruit flies  Most assemblers  Designed for genomic data  Don’t know how to handle splicing  But Trinity can!

Transcriptome assembly: alternative splicing example Vera and Wheat et al What effects will this have on a microarray?

 Uses Illumina PE data  Incorporates alternative splicing into its assembly  Does great job assembling full length transcripts  Successfully predicts many isoforms as well 30 Grabherr et al. 2011

 Downside:  Generates potential incorrect isoforms  Different contigs for each haplotype  SNP by splicing event  Can cluster these results, possibly using CAP3 software for consensus and SNP calling 31 Grabherr et al. 2011

Calling SNPs  Many programs do this now  Each sequencing method has specific errors associated  Best to use SNP calls > 2 reads for minor allele to ensure validity  Generate consensus sequences with SNP calls as template for probe design  Know the sensitive region of probes to SNP/indel variation … Agilent probes are robust! 32

SNP calling33 Wheat, C. W. (2012) SNP Discovery in Non-model Organisms Using 454 Next Generation Sequencing. Many different methods, criteria. Just cause its published doesn’t make it ideal for you

Choosing Probes  Binding performance  SNPs, indels, alternative splicing  Avoid them?  Use them, via tiling probes?  All genes or just annotated ones?  3’ UTR end or tiling across whole gene  Recommend  Technical replicates within array  Run a test array to assess design  Combination of the above? 34

35

Potential example  Only genes / contigs with annotations  Probes in triplicate  Tiled across entire gene  Covering SNPs, indels, atl. Splicing sites  Initial array designed, printed, and tested with several different RNA pools to look at probe hybridization performance  Full experimental set of arrays ordered + 20% 36

Challenge: Bioinformatics  Annotation of fragmented data  Multiple contigs may belong to same gene  Unannotated sequences (novel coding, UTR, junk?)  How conduct statistical analysis of the fragmented data?  Combine results, pick best probes, etc.?  Are outliers biological or technical  If biological, separate loci or splicing?  Unannotated probes with significant results  Where to go?

What will change tomorrow  Read lengths and quality  Read lengths per DNA strand  Paired end fragment sizes  Parallelization  Number of samples per run  Amount of starting material needed  Bioinformatic tools  RNA-Seq more common ……

What won’t change tomorrow  Need for good experimental questions & design  Biological realities  Complications of finding the genes  Expression  Patterns of genetic variation  Need for validation (indep. & higher)  Limited annotation insights

Conclusion  Many methods and rationals for using some over others  You needed to decide what you want  Arrays work great, but will they take you where you want to go?  Analysis is the most challenging part, so work with datasets that will be similar to yours.  Can you get answers from those that you want?  What software/program skills do you need?  Collaboration helps for many things

Some references  Feldmeyer, B., C. W. Wheat, N. Krezdorn, B. Rotter and M. Pfenninger (2011) Short read Illumina data for the de novo assembly of a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance. BMC Genomics, 12:317.  Wheat, C. W., and H. Vogel (2011) Transcriptome sequencing goals, assembly and assessment in V. Orgogozo, and M. V. Rockman, eds. Molecular methods for evolutionary genetics. Humana Press, New York.  Wheat, C. W Rapidly developing functional genomics in ecological model systems via 454 transcriptome sequencing. Genetica 138: PDF.  Hornett, E. A. and C. W. Wheat (2012) Quantitative RNA-Seq analysis in non- model species: assessing transcriptome assemblies as a scaffold and the utility of evolutionary divergent genomic reference species. BMC Genomics. 13:361.  Grabherr, M. G. et al. (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 29, 644–652  Kumar, S. and Blaxter, M. L. (2010) Comparing de novo assemblers for 454 transcriptome data. BMC Genomics. 11, Many available on my website:

Some recommendations  Illumina sequencing, paired end, variable fragment size from , unnormalized (but normalized is better).  Many individuals X tissues X treatment, etc., to reflect the experimental material  Assemble with Trinity, join isoforms and haplotypes into contigs using CAP3  Assess via BLAST to relevant species  Annotate dataset  Design probes for annotated genes, tiling when possible for SNPs, indels, and splicing  Consider running test set of probes to assess. 42

Thanks