2 Overview of RNA-Seq Activities RNA-Seq Concepts, Terminology, and Work FlowsUsing Single-End Reads and a Reference GenomeGene Construction with Paired-End Reads (and a genome)Alignment to a Reference TranscriptomeRNA-Seq Statistics: Blythe Durbin-JohnsonNow that you’re adept at using Galaxy, you’ll be doing the exercises “on your own”. Don’t worry if you can’t finish them all today.
3 RNA Transcription and Processing A cell contains many types of RNA (rRNA, tRNA, mRNA, miRNA, lncRNA, snoRNA, etc.) – Only ~2% is mRNAFocus: RNA vs. DNA; different types of RNA in cell; mRNA (polyA RNA) transcription/splicing -> translation -> degradationKoning, Plant Physiology Information Website
4 Gene Structure and Alternative Splicing Pearson Education Inc., 2008Focus: One gene in genomic DNA can be transcribed into many different mRNA (splicing): mRNA sequence vs. DNA sequenceMolecular Biology of the Cell, 4th ed.
5 Some mRNA-Seq Applications Differential gene expression analysisTranscriptional profilingIdentification of splice variantsNovel gene identificationTranscriptome assemblySNP findingRNA editingAssumption:Changes in transcription/mRNA levels correlate with phenotype (protein expression)Focus: RNA-Seq can be used in many experiments to answer biological questions.
6 Experimental Design What biological question am I trying to answer? What types of samples (tissue, timepoints, etc.)?How much sequence do I need?Length of read?Platform?Single-end or paired-end?Barcoding?Pooling?Biological replicates: how many?Technical replicates: how many?Protocol considerations?Focus: Considerations before designing your experiment
7 What Is the Goal of the Experiment? Many biological questions, such as…“Characterize the differences between the wild-type and mutant”are broad and open-ended.Such RNA-Seq experiments can be used to generate hypotheses, help form a more-focused question for the next experiment.Make sure your experimental approach is suitable for the question you’re asking. (You will not find mutations in non-transcribed regions with RNA-Seq.)Focus: Considerations before designing your experiment
8 Influence of the Organism Novel – little/no previous sequencingNon-Model – some sequence available (ESTs, Unigene set)Genome-Sequenced – draft genomeThousands of scaffolds, maybe tens of chromosomesSome annotation (ab initio, EST-based, etc.)Model – genome fully sequenced and annotatedMultiple genomes available for comparisonWell-annotated transcriptome based on experimental evidenceGenetic maps with markers availableBasic research can be conducted to verify annotations (mutants available)Focus: Consider how much annotated sequence is available for your organism, and the heterozygosity/ploidy.
9 Amount of Sequence ENCODE Guidelines (v.1 2011, never updated?) Depth determined by the goals of the experiment and type of sample20-25M mappable PE reads (~30M raw PE reads) of length >30NT for polyA samplesM PE76+ reads needed to reliably detect low copy transcripts/isoforms from a typical mammalian tissueVarious studies (often species and tissue/cell line specific):10-20 million reads sufficient for differential gene expressionBut, additional unique transcripts still being found at 1 billion readsFocus: For RNA-Seq, need millions of reads (not necessarily average coverage, like with genome)Figures using hg19 data: Expressions calculated from randomly drawn subsets of reads compared with all 45 million reads. Left: the fraction of genes that are within specified fold-change intervals from the final expression level at different sequence depths, for all genes expressed over one RPKM. Right: The fraction of genes at different sequence depths that are within 20% of the final expression value that was estimated using all 45 million mappable reads. Genes have been grouped according to final RPKM expression level. The different sequence depths were obtained by selecting random subsets of mapped reads and the results are presented as mean and 95% confidence intervals.Bacterial RNA-Seq:one paper did 15M reads (GAI);Second paper, also ran on GAI, ran multiple lanes, getting 3.1 – 7.4 M reads per lane.Third paper, generated million reads per sample on GAII (this paper did comparisons, so is probably best to use?) (He et al., 2010, Nat. Methods 7(10):807)
10 Amount of Sequence Differential gene expression, reads/sample Eukaryotes: 30+ million recommendedBacteria: 10+ million recommendedMore sequence is needed to detect rare transcriptsMeasures of Robustness of Expression Levels vs. Sequencing Depth (hg19)Focus: For RNA-Seq, need millions of reads (not necessarily average coverage, like with genome)Figures using hg19 data: Expressions calculated from randomly drawn subsets of reads compared with all 45 million reads. Left: the fraction of genes that are within specified fold-change intervals from the final expression level at different sequence depths, for all genes expressed over one RPKM. Right: The fraction of genes at different sequence depths that are within 20% of the final expression value that was estimated using all 45 million mappable reads. Genes have been grouped according to final RPKM expression level. The different sequence depths were obtained by selecting random subsets of mapped reads and the results are presented as mean and 95% confidence intervals.Bacterial RNA-Seq:one paper did 15M reads (GAI);Second paper, also ran on GAI, ran multiple lanes, getting 3.1 – 7.4 M reads per lane.Third paper, generated million reads per sample on GAII (this paper did comparisons, so is probably best to use?) (He et al., 2010, Nat. Methods 7(10):807)Ramskold, et al., 2012
11 Platform and Read Length Options Applications40+ SEIllumina(SOLiD)Gene expression quantitationSNP-finding40+ PEIon ProtonBetter specificity for the aboveSplice variant identification100+ PEAll the above and:Differentiation within gene familes/paralogsTranscriptome assembly15kb+20kb+?Ion TorrentSanger(454)PacBio(OxfordNanopore)Resolve haplotypes (phasing)Not recommended for gene expression quantitationFocus: minimum read length is dictated by the application/biological question. Explain that long reads are from lower throughput platforms.Notes: added the PacBio to same box as 454/Ion Torrent/Sanger since the Applications are all the same.Per Ion Torrent website (iontorrent.com, co31580 workflow.pdf) readlength is currently 100 bp but will increase to 200 bp in 4Q 2011 when new 318 chip is released.Per Roche/454 website (GSFLXApplicationFlyer_FINALv2.pdf, 6/11), Drosophila cDNA read length is currently ~ basesPer SOLiD website ( max read lengths are 75 SE, 60 x 60 mate paired, 75 x 35 PE
12 MultiplexingcDNA InsertIndexAdapterShort (6-8 nt), unique barcodes (index) introduced as part of adaptersProvide unique identifier for each sampleBarcodes should be tolerant of 1-2 sequencing errorsBarcodes allow deconvolution of samplesAllows pooling samples to mitigate lane effectsAllows sequencing capacity to be used efficientlyDual barcodes allow deep multiplexing (e.g., 96 samples)Focus: multiplexing is advantageous. (More details to come in Friday’s talks)See datasheet_sequencing_multiplex.pdf: the diagram for the Illumina barcodes may be a little complicated for this talk, also only pertains to Illumina barcodes.
13 Biological Replicates Allow measurement of variation between individuals/samplesMore replicates increase statistical powerSome studies support increasing replicates over deeper sequencing (as long as you can detect genes of interest)Genetic Variation/Hetrozygosity:Is each individual a different genotype?Are individuals highly inbred or clonal?Haploid or diploid or polyploid?Pooling with barcodes – each sample is a replicatePooling without barcodes – each pool is a replicateValidation on individual samplesFocus: Advantages of biological replicates (more details in Vince’s stats talk)
14 Technical Replicates Account for variation in preparation Cost can be prohibitiveBetter to do more biological replicatesBarcoding/pooling samples across multiple lanesRecommended to even out lane effectsAllow data processing even if one lane failsFocus: Technical replicates are generally not cost effective. There’s very little variation between lanes.
15 Illumina HiSeq Flow Cell Lanes ExampleThis experimental design has biological replicates and is multiplexed to mitigate lane effectsEach sample will generate, on average, ~50 million reads.Control: 6 biological replicates123456Treated: 6 biological replicates123456Focus: Example of experiment with multiple replicates pooled and run over multiple lanesEach sample is individually barcoded; all samples are pooled and run in two lanes of HiSeq3000Illumina HiSeq Flow Cell Lanes
16 mRNA-Seq Protocol Overview Sample RNAsTOTAL RNAmRNAFRAGMENTED mRNA/cDNAFocus: basic protocol for converting total RNA to cDNA molecules for sequencing: polyA selection, fragmentation, adding adapters.FINISHED LIBRARYAdapted from Simon et al., 2009, Ann. Rev. Plant Biol. 60:305
17 Wetlab Considerations The greatest source of variability in RNA-Seq experiments is “the human factor”.All RNA isolations need to be done by the same person with the same kit (preferably the same lot).Practice on non-precious samples. Then, if you can, process extra samples, and use those with the most consistent quality for library prep.All library preps need to be done by the same person with the same kit (preferably the same lot).If you must process in batches, each batch should consist of a randomized subset of samplesIf you are processing many samples (>~20) consider robotics.Use a logical, alphanumeric sample ID system. (Don’t use symbols, like “+” and “-”.)Focus: polyA mRNA isolation is used most often and is very effective.
18 RNA Processing PolyA Selection rRNA Depletion Oligo-dT, often using magnetic beadsIsolates mRNA very efficiently unless total RNA is very diluteCan’t be used to sequence non-polyA RNArRNA DepletionRiboZero, RiboMinusNon-polyA RNAs preserved (non-coding, bacterial RNA, etc.)Can be less effective at removing all rRNATotal RNA samplerRNAmRNAAfter PolyA isolationAfter rRNA depletionFocus: polyA mRNA isolation is used most often and is very effective.
19 Strand-Specific (Directional) RNA-Seq Now the default for Illumina TruSeq kitsPreserves orientation of RNA after reverse transcription to cDNAInforms alignments to genomeDetermine which genomic DNA strand is transcribedIdentify anti-sense transcription (e.g., lncRNAs)Quantify expression levels more preciselyDemarcate coding sequences in microbes with overlapping genesVery useful in transcriptome assembliesAllows precise construction of sense and anti-sense transcripts--3’5’--
20 Insert SizeAdapter~60 basescDNAbases=Insert SizeThe “insert” is the cDNA (or RNA) ligated between the adapters. Typical insert size is bases, but can be larger. Insert size distribution depends on library prep method.Overlapping paired-end readsFocus: The cDNA insert is between the adapters. Showing SE and PE reads in relation to the insert.Gapped paired-end readsSingle-end read20406080100120140160180200
21 Paired-End Reads Read 1 Read 1 Read 2 INSERT SIZE (TLEN) 20406080100120140160180200Read 1 Read 2INSERT SIZE (TLEN)INNER DISTANCE (+ or -)Focus: The cDNA insert is between the adapters. Showing SE and PE reads in relation to the insert.20406080100120140160180200Read 1 Read 2
22 Differential Gene Expression Generalized Workflow Focus: introduction of the workflow – details on next slides
23 Adapter Contamination cDNA InsertIndexAdapterRead 1 (forward)Read 2 (reverse)cDNA inserts are a distribution of sizesThere will be some read-through with adapter sequence at 3’ endRemoval of adapter contamination can improve fraction of reads that align to the referenceVery important for de novo assembliesFocus: multiplexing is advantageous. (More details to come in Friday’s talks)See datasheet_sequencing_multiplex.pdf: the diagram for the Illumina barcodes may be a little complicated for this talk, also only pertains to Illumina barcodes.
24 Alignment - Choosing a Reference Fully sequenced and annotated genomeProvides exon information to find splice variantsPredicted/validated transcriptomeSimple to useComprehensive for all but the most novel genesNCBI Unigene SetsOften incompleteGood for medium to highly expressed genesNo Genome? No Problem!Transcriptome assemblyUseful for organisms with little or no sequence availableBut, expect some redundancy and collapsing of gene familiesFocus: Choice of reference depends on experimental question, if your organism is annotated
25 Read Alignment and Counting Align reads to genome or transcriptome (output sam/bam)Convert alignments to read-counts per geneMay need to parse genomic intervals from gene modelsOutput is table of raw counts per gene for each sampleSimple NormalizationRPKM (Reads Per Kilobase per Million reads mapped)FPKM (Fragments Per Kilobase per Million reads mapped)Fragment = cDNA insertIdeally, there are two mappable PE reads per fragmentCPM (counts per million)TMM (Trimmed Mean of M-Values) – used in edgeR, presumes most genes are not differentially expressedStatistical Analysis (Blythe’s talk)Compare expression between samples, tissues, etc.Use appropriate statistical model for your experiment.Focus: re-iterate main points, including file formats, link to stats talk to follow.
26 Read Alignment and Counting Alignment to Genome – one splice variant(Exon A - 21)(Exon B - 12)(Exon C - 22)Alignment to TranscriptomeFocus: Reads can be aligned to genome or transcriptome. Depending on aligner used, some reads across intron-exon boundaries may not be counted, but that will be consistent across genome.(66)
27 Read Alignment and Counting Alignment to Genome – two splice variants(Exon A - 21)(Exon B - 6)(Exon C - 22)Alignment to Transcriptome (Gene Sequences)Focus: If there are two splice variants, relative expression levels can be calculated by exon, but generally won’t be found in transcriptome mapping (Vince will cover multi-mappers)(57)
28 Splicing-Aware Alignment A splicing-aware aligner will recognize the difference between a short insert and a read that aligns across exon-intron boundariesSpliced Reads(joined by dashed line)Read Pairs(joined by solid line)
29 Transcript Reference vs. Genome Reference Some reads will align uniquely to an exon in the genome. But how can transcript abundance be determined?
30 Multiple Mapping within the Genome Some reads will align to more than one location in the genome. Which gene/transcript should this read be assigned to?
31 Multiple-Mapping Reads (“Multireads”) Some reads will align to more than one place in the reference, due to:Common domains, gene familiesParalogs, pseudogenes, etc.Shared exons (if reference is transcriptome)This can distort counts, leading to misleading expression levelsIf a read can’t be uniquely mapped, how should it be counted?Should it be ignored (not counted at all)?Should it be randomly assigned to one location among all the locations to which it aligns equally well?This may depend on the question you’re asking……and also depends on the software you use.
32 Choosing an Aligner Transcriptome reference – BWA, Bowtie2 Splicing-aware not neededGenome referenceAligner must be splicing-aware to account for reads that cross intron-exon boundariesTopHat2 (Bowtie2) (GSNAP (research-pub.gene.com/gmap/)STAR ( – fast, but uses most memoryBbmap ( – fast, part of tool suiteHISAT ( – newest, fast with low memory use.Each aligner has multiple parameters that can be tweaked, affecting read mapping resultsMost software is updated regularly, to improve performance and accommodate new technologiesGET ON THE MAILING LISTS/GOOGLE GROUPS!
33 How Well Did Your Reads Align to the Reference? Calculate percentage of reads mapped per sample% Reads MappedIncomplete Reference?GoodGreat!Sample Contamination?Focus: It’s important and informative to calculate % reads mapped per sample
34 RNA Quality Assessment rRNA contaminationAlign reads to rRNA sequences from organism or relativesGenerally, don’t need to remove rRNA readsTotal RNA samplerRNAmRNAAfter PolyA isolationAfter rRNA depletionFocus: Need to do standard QA&I plus rRNA contamination check; but sample protocol has influence
35 Checking Your Results Key genes that may confirm sample ID Knock-out or knock-down genesGenes identified in previous researchSpecific genes of interestHypothesis testingImportant pathwaysExperimental validation (e.g., qRT-PCR)Generally required for publicationThe best way to determine if your analysis protocol accurately models your organism/experimentIdeally, validation should be conducted on a different set of samplesFocus: Ways to check results to see if they make sense based on the expectations of your experiment
36 This paper compares eleven methods Analysis ChoicesThis paper compares eleven methodsThis paper compares seven methods
37 Analysis ChoicesEvaluated 6 differential gene expression analysis software packages (did not investigate differential isoform expression)Increasing replicates is more important than increasing sequencing depthTranscript length bias reduces the ability to find differential expression in shorter genes.limma and baySeq most closely model “reality”.limma and edgeR had the fewest number of false positives.BUT, 5 of 6 packages were out-of-date by publication date; at least two changed substantially, so this analysis might be different today
38 Where to find some guidance? The RNA Sequencing Quality Control (SEQC) Consortium and the Association of Bimolecular Resource Facilities (ABRF) conducted several systematic large-scale assessmentsRNA-Seq replicates run in ABRF study:15 lab sites4 protocols (polyA-select, ribo-depleted, size selected, degraded)5 platforms (HiSeq, Ion Torrent PGM & Proton, PacBio, 454)SEQC generated over 100 billion reads across three platformsMore than 10Tb data available for analysis
39 Differential Gene Expression Generalized Workflow Bioinformatics analyses are in silico experimentsThe tools and parameters you choose will be influenced by factors including:Available reference/annotationExperimental design (e.g., pairwise vs. multi-factor)The “right” tools are the ones that best inform on your experimentDon’t just shop for methods that give you the answer you wantFocus: introduction of the workflow – details on next slides
40 Differential Gene Expression Generalized Workflow File TypesFastqFasta (reference)gtf (gene features)Sam/bamFocus: introduction of the workflow – details on next slidesText tables
41 The GTF (Gene Transfer Format) File The left columns list source, feature type, and genomic coordinates The right column includes attributes, including gene ID, etc.
42 Fields in the GTF FileSequence Name (i.e., chromosome, scaffold, etc.) chr12Source (program that generated the gtf file or feature) unknownFeature (i.e., gene, exon, CDS, start codon, stop codon) CDSStart (starting location on sequence)End (end position on sequence)ScoreStrand (+ or -)Frame (0, 1, or 2: which is first base in codon, zero-based) 2Attribute (“;”-delimited list of tags with additional info)This attribute provides info to Tophat/Cufflinksgene_id "PRMT8"; gene_name "PRMT8"; p_id "P10933";transcript_id "NM_019854"; tss_id "TSS4368";
44 Differential Gene Expression Generalized Workflow Softwarescythesickletophat2/bowtie2Focus: introduction of the workflow – details on next slides(cufflinks2) cuffdiffOrR packages (edgeR/limma)
45 Today’s Exercises – Differential Gene Expression Today, we’ll analyze a few types of data you’ll typically encounter:Single-end reads (“tag counting” with well-annotated genomes) using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data.Paired-end reads (finding novel transcripts in a genome with incomplete annotation)Paired-end reads for differential gene expression when only a transcriptome is available (such as after a transcriptome assembly)
46 Today’s Exercises – Differential Gene Expression And we’ll be using a few different software:Tophat to align spliced reads to a genomeCuffdiff for differential expression of transcripts/genes from tophat alignmentscummeRbund to generate diagnostic plots from cuffdiff outputhtseq-count to generate raw counts tables for…edgeR, which can also handle more complex experimental designsCufflinks to find novel genes and transcriptsBowtie2 to align reads to a transcriptome referencesam2counts to extract raw counts from bowtie2 alignments
47 Today’s Exercises – Differential Gene Expression Later, we’ll talk about other RNA-Seq topics, includingThe ever-increasing software packages in the Tuxedo SuiteFinding novel transcripts (“Gene Construction”)And, what happens after the analysis? (Is there an “after”?)But for now, let’s get some analysis going!