7 RNA-Seq overviewLi J, Witten DM, Johnstone IM, Tibshirani R. (2011) Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics Oct 14.
8 McIntyre LM, Lopiano KK, Morse AM, Amin V, Oberg AL, Young LJ, Nuzhdin SV. (2011) RNA-seq: technical variability and sampling. BMC Genomics. 12:293.
9 RNA-Seq overviewAuer PL, Doerge RW. (2010) Statistical design and analysis of RNA sequencing data. Genetics 185:
10 RNA-Seq overviewPevsner, J (2009) Bioinformatics and Functional Genomics Wiley-Blackwell p. 333
11 RNA-Seq overviewPevsner, J (2009) Bioinformatics and Functional Genomics Wiley-Blackwell p. 338
12 RNA-Seq overviewPevsner, J (2009) Bioinformatics and Functional Genomics Wiley-Blackwell p. 338
13 RNA-Seq overviewPevsner, J (2009) Bioinformatics and Functional Genomics Wiley-Blackwell p. 338
14 Pevsner, J (2009) Bioinformatics and Functional Genomics Wiley-Blackwell p. 350
15 RNA-Seq overviewPevsner, J (2009) Bioinformatics and Functional Genomics Wiley-Blackwell p. 353
16 RNA-Seq overviewPevsner, J (2009) Bioinformatics and Functional Genomics Wiley-Blackwell p. 354
17 mRNA gene expression Biostatistical expertise is essential Study design and power estimates need to be worked out before sequencingLi J, Witten DM, Johnstone IM, Tibshirani R. (2011) Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics Oct 14.McCarthy DJ, Chen Y, Smyth GK. (2012) Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. (Feb. 6)Fang Z, Cui X. (2011) Design and validation issues in RNA-seq experiments. Brief Bioinform. 12(3):
18 mRNA gene expression Influence of sequencing depth Influence of mappabilityMercer TR, Gerhardt DJ, Dinger ME, Crawford J, Trapnell C, Jeddeloh JA, Mattick JS, Rinn JL. (2011) Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Nat Biotechnol. 30(1):Roberts A, Pachter L. (2011) RNA-Seq and find: entering the RNA deep field. Genome Med. 3(11):74.Derrien T, Estellé J, Marco Sola S, Knowles DG, Raineri E, Guigó R, Ribeca P. (2012) Fast computation and applications of genome mappability. PLoS One. 7(1):e
19 mRNA gene expressionFigure 1. To demonstrate the difficulty of accurate isoform-level abundance estimation on low-abundance genes, we simulated an RNA CaptureSeq experiment on the 18 isoforms of the dystrophin gene as annotated in RefSeq (hg19), shown in (a). For each number of 76 bp paired-end fragments that aligned to the gene, we estimated abundances of each isoform using the online EM algorithm (for details of the model used see  and for details of the implementation see ). (b) The accuracy of isoform abundance estimation measured as the Pearson correlation coefficient (r) of the logged relative abundance estimates compared with the true abundance used to generate the simulated data. The results are averaged over four simulations from different random abundance distributions. Because of the similarity of the isoforms, only 2.5% of fragments aligned uniquely to a single isoform on average, making the deconvolution particularly difficult. The bottom x-axis shows how many alignable paired-end fragments would be required to achieve the same r in a genome-wide experiment as in the CaptureSeq simulation. Here we assume 3.17 fragments per kilobase per million mapped reads (FPKM) for the gene, which is what we estimated from a sample ENCODE dataset [accession SRR065495].
22 mRNA gene expression Guidelines and reviews Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium.Jiang L, Schlesinger F, Davis CA, Zhang Y, Li R, Salit M, Gingeras TR, Oliver B. (2011) Synthetic spike-in standards for RNA-seq experiments. Genome Res. 21(9):Garber M, Grabherr MG, Guttman M, Trapnell C. (2011) Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 8(6):Ramsköld D, Kavak E, Sandberg R. (2012) How to analyze gene expression using RNA-sequencing data. Methods Mol Biol. 802:
23 mRNA transcript isoforms TopHat and related programsTrapnell C, Pachter L, Salzberg SL. (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9):Langmead B, Hansen KD, Leek JT. (2010) Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. 11(8):R83.Kim D, Salzberg SL. (2011) TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biol. 12(8):R72.Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 28(5):Roberts A, Pimentel H, Trapnell C, Pachter L. (2011) Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics 27(17):Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. (2011) Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 12(3):R22.
26 Figure 1 | Strategies for gapped alignments of RNA-seq reads to the genome. (a,b) An illustrationof reads obtained from a two-exon transcript;black and gray indicate exonic origin of reads.Exon-first methods (a) map full, unspliced reads(exonic reads), and remaining reads are dividedinto smaller pieces and mapped to the genome.An extension process extends mapped pieces tofind candidate splice sites to support a splicedalignment. Seed-and-extend methods (b) store amap of all small words (k-mers) of similar size inthe genome in an efficient lookup data structure;each read is divided into k-mers, which are mappedto the genome via the lookup structure. Mappedk-mers are extended into larger alignments,which may include gaps flanked by splice sites.(c) A potential disadvantage of exon-firstapproaches illustrated for a gene and its associatedretrotransposed pseudogene. Mismatchescompared to the gene sequence are indicated inred. Exonic reads will map to both the gene andits pseudogene, preferring gene placement owingto lack of mutations, but a spliced read couldbe incorrectly assigned to the pseudogene as itappears to be exonic, preventing higher-scoringspliced alignments from being pursued.nature methods
27 Figure 2 | Transcriptome reconstruction methods Figure 2 | Transcriptome reconstruction methods. (a) Reads originating from two different isoforms of thesame genes are colored black and gray. In genome-guided assembly, reads are first mapped to a referencegenome, and spliced reads are used to build a transcript graph, which is then parsed into gene annotations.In the genome-independent approach, reads are broken into k-mer seeds and arranged into a de Bruijngraph structure. The graph is parsed to identify transcript sequences, which are aligned to the genome toproduce gene annotations. (b) Spliced reads give rise to four possible transcripts, but only two transcriptsare needed to explain all reads; the two possible sets of minimal isoforms are depicted.
28 Figure 3 | An overview of gene expression quantification with RNA-seq. (a) Illustration of transcripts of different lengths with different readcoverage levels (left) as well as total read counts observed for eachtranscript (middle) and FPKM-normalized read counts (right). (b) Readsfrom alternatively spliced genes may be attributable to a single isoformor more than one isoform. Reads are color-coded when their isoform oforigin is clear. Black reads indicate reads with uncertain origin. ‘Isoformexpression methods’ estimate isoform abundances that best explain theobserved read counts under a generative model. Samples near the originalmaximum likelihood estimate (dashed line) improve the robustness of theestimate and provide a confidence interval around each isoform’s abundance.(c) For a gene with two expressed isoforms, exons are colored according tothe isoform of origin. Two simplified gene models used for quantificationpurposes, spliced transcripts from each model and their associated lengths,are shown to the right. The ‘exon union model’ (top) uses exons from allisoforms. The ‘exon intersection model’ (bottom) uses only exons commonto all gene isoforms. (d) Comparison of true versus estimated FPKM values insimulated RNA-seq data. The x = y line in red is included as a reference.
29 Figure 4 | Overview of RNA-seq differential expression analysis. (a) Expression microarrays rely on fluorescence intensity via a hybridizationof a small number of probes to the gene RNA. RNA-seq gene expressionis measured as the fraction of aligned reads that can be assigned to thegene. (b) A hypothetical gene with two isoforms undergoing an isoformswitch between two conditions is shown. The total number of reads aligningto the gene in the two conditions is similar, but its distribution acrossisoforms changes. Differential expression using the simplified exon unionor exon intersection methods reports no changes between conditions whileestimating read counts and expression for the individual isoforms detectsboth differential expression at the gene and isoform level.
30 mRNA special cases Non-polyadenylated transcripts Nascent transcripts + co-transcriptional splicingCircular transcriptsA to I editing
31 We found that a few excised introns accumulate in cells and thus constitute a new class of non-polyadenylated longnon-coding RNAs. Finally, we have identified a specific subset of poly(A)- histone mRNAs, including two histone H1variants, that are expressed in undifferentiated hESCs and are rapidly diminished upon differentiation; further, thesesame histone genes are induced upon reprogramming of fibroblasts to induced pluripotent stem cells.Yang L, Duff MO, Graveley BR, Carmichael GG, Chen LL. (2011) Genomewide characterization of non-polyadenylated RNAs. Genome Biol. 12(2):R16.
32 Figure 3 Classification of bimorphic transcripts Figure 3 Classification of bimorphic transcripts. (a) Gene ontology analysis of bimorphic transcripts according to their functions; see text fordetails. (b) Overlapping analysis of the expression of bimorphic transcripts in H9 and HeLa cells. (c) An example of a bimorphic histone mRNA,hist1h2afx. Normalized read densities of hist1h2afx from the UCSC genome browser (upper panels). Note that the two isoforms were distinct inpoly(A)+ and poly(A)- samples from H9 and HeLa cells. Bottom panels, semi-quantitative RT-PCR with primers that recognize either the longerpoly(A)+ transcript or both transcripts confirmed the observations from the deep sequencing. F, forward primer; 1R and 2R, reverse primers. Thevertical arrow depicts the position of U7-mediated 3’ end formation. (d) Validation of identified bimorphic transcripts, ccng1(left panels), nr6a1(right upper panels) and gprc5a (right bottom panels). Normalized read densities and qRT-PCRs were analyzed as described above. Note that thesignals for these transcripts were similar in both poly(A)+ and poly(A)-samples. See text for details. Error bars were calculated from threebiological repeats.
33 Figure 1 A large proportion of RNA-seq reads map to intronic regions Figure 1 A large proportion of RNA-seq reads map to intronic regions. (a) The percentage of reads mapping to intronic, exonic and intergenic regions is shown for each RNA-seq dataset (on the y axis). For introns, only reads mapping to the same strand as the surrounding gene are considered. Reads located inside introns but on the opposite strand are labeled intergenic. Introns of length >100 kb (XL introns) are shown by separate bars. (b) Venn diagrams containing the number of identified genes with high intronic RNA score in human fetal and adult brain (left) and human fetal and adult liver (right).Ameur A, Zaghlool A, Halvardson J, Wetterbom A, Gyllensten U, Cavelier L, Feuk L. (2011) Total RNA sequencing reveals nascent transcription and widespread co-transcriptional splicing in the human brain. Nat Struct Mol Biol. 18(12):
34 mRNA special casesFigure 2 Nascent transcription and co-transcriptional splicing. (a) Pattern for AUTS2 (top) and C21orf34, a noncoding RNA gene (bottom), viewed in the University of California, Santa Cruz (UCSC) Genome Browser47. The RNA-seq signals have been smoothed using window averaging. For both protein coding genes and long noncoding RNA genes, there is an apparent ‘saw-tooth’ pattern with higher RNA-seq signal toward the 5′ ?end of each intron. (b) Model for co-transcriptional splicing. The total RNA-seq data give rise to a typical saw-tooth pattern across genes that are actively transcribed. The gradient of RNA across the introns can be explained by a large number of nascent transcripts in various stages of completion. The pattern is repeated for each intron because the nascent transcript is spliced very rapidly after the polymerase completes transcribing each intron. The sequence read coverage is comparatively higher for exons, as the RNA-seq is measuring both the pool of nascent transcripts and the pool of mature polyadenylated RNA.
35 Figure 3 A gradient of RNA levels within introns Figure 3 A gradient of RNA levels within introns. (a) The relative RNA levels of the first two exons and the surrounding introns of GRID2 were measured in complementary DNA (cDNA) from human fetal frontal cortex total RNA using quantitative real-time PCR. The qrtPCR results (bottom) correlate with the intronic signals in the RNA-seq data (top). A schematic view of the two first exons of GRID2 and the intermediate intron is seen in the middle panel. The arrows indicate the location of primers (P1 to P11) used in the experiment. The qrtPCR values are based on three independent experiments (error bars are s.d.). (b) RNA-seq signal over all introns in the genome of at least 50 kb in length (L and XL introns), measured in average depth of coverage per million mapped reads (average dcpm). Each intron was divided into 100 bins, and an average value was calculated for each of the individual bins. The four human samples show a decrease of RNA signal across the intron, with the brain samples showing much steeper slopes compared to liver, indicating nascent transcription and co-transcriptional splicing. The dotted lines at the bottom show the RNA-seq signals on the opposite strand.
36 mRNA special casesFigure 1. Models to explain exon scrambling. The canonical linear reference transcript is depicted with exons as colored boxes with four exons1, 2, 3, and 4. Two simple models of RNA structure that could explain scrambled transcripts are depicted at left and right. At left, model 1 depicts howa scrambled exon 3-exon 2 junction could arise from a tandem duplication of exons 3 and 2, positioning the first copy of exon 3 upstream of exon 2.At the RNA level, this event could arise from post-transcriptional exon rearrangement, or a genomic duplication of exons 2 and 3. Under the model oftandem duplication, when one side of a paired-end read maps to the junction between exon 3 and 2, the other may map to any of exons 1, 2, 3 or 4with probabilities determined by the library’s insert length distribution and the exon lengths. Our data supports paired-end mapping between ajunction and exons 2 or 3, but not exons 1 and 4. We note that in principle, the scrambled exon 3 - exon 2 junction could arise from other splicingevents and does not necessarily entail tandem duplication. At right, model 2 depicts how a scrambled exon 3 - exon 2 junction could arise fromsplicing of exons 2 and 3 into a circular RNA molecule, again positioning exon 3 upstream of exon 2. In this model, when one side of a paired-endread maps to the junction between exon 3 and 2, the other will map to exon 2 or exon 3.doi: /journal.pone g001Salzman J, Gawad C, Wang PL, Lacayo N, Brown PO. (2012) Circular RNAs Are the Predominant Transcript Isoform from Hundreds of Human Genes in Diverse Cell Types. PLoS One. 7(2):e
37 mRNA special casesFigure 2. Expression levels of scrambled exons. Analysis ofpaired-end RNA-Seq data from random primed libraries revealsevidence that scrambled exons are present at high stoichiometriescompared to the canonical linear transcript transcribed from a largenumber of human genes. This phenomenon persists across cell typesand is illustrated by the expression patterns of 3 leukocyte cell types:CD19 (B cells), CD34 (stem cells) and neutrophils. The fraction of eachscrambled transcript as a fraction of total gene expression is computed.The bar plot depicts the number of circular isoforms with estimatedabundance relative to all transcripts of the gene in the following ranges:between 0–25%, 25–50%, 50–75% and 75+%. Hundreds of isoforms ineach cell type are estimated to represent more than half of alltranscripts from each gene.doi: /journal.pone g002
38 Figure 3. RNaseR assay confirms scrambled exons arise from circular RNA. Panel A: Total RNA from HeLa cells was digested with RNaseR atvarying enzyme concentrations (0, 3, 10, and 100 units) after the RNA was depleted of ribosomal RNA. Primers capable of amplifying the canonicallinear transcript and the predicted circular transcript (by outward facing primers within a single exon predicted in the scramble) were used in a RTPCRexperiment for each of the digestion conditions. Canonical transcripts were consistently degraded by RNaseR, only detectable by PCR at 0 unitsof RNaseR, whereas predicted circular transcripts consistently resisted the RNaseR challenge, providing strong evidence of circularity. FBXW4 andMAN1A2 respectively show 2 and 4 circular isoforms, both of which were predicted by the sequencing data. The predicted lengths of circularisoforms are respectively a 3-2 junction of CAMSAP1 (predicted to produce a 435 bp circle), a 4-2 and 5-2 junction of FBXW4 (predicted to produce415 and 510 bp circles), a 4-2, 5-2 and 6-2 junction of MAN1A2 (predicted to produce 471, 553, and 648 bp circles), a 3-3 junction in REXO4 (predictedto produce a 338 bp circle), a 2-2 junction of RNF220 (predicted to produce a 742 bp circle) and a 3-2 junction of ZKSCAN1 (predicted to produce a667 bp circle). Panel B: A northern blot on total and cytoplasmic lysate from HeLa cells shows hybridization of a 481 bp probe complementary to theMAN1A2 5-2 exon scramble. 3.7 and 6.2 ug of total and cytoplasmic RNA were loaded onto a 1% agarose gel and 10 pM of probe was hybridized for24–48 hours. Detection was performed using the BrightStar BioDetect Kit (Ambion, Austin, TX). The specific band at 553 bp corresponds to thepredicted size of a circular RNA containing exons 2,3,4 and 5 of MAN1A2.doi: /journal.pone g003
39 mRNA special casesFigure 6. Models for generation of circular RNA. At left: a schematic diagram of the canonical splicing process splicing out the first intron of thea pre-mRNA of a 4 exon gene, and subsequent removal of introns 2 and 3. Canonical splicing of exon 1 to exon 2 occurs when the splicing machinerycatalyzes the formation of the intron lariat and the attack of the free 3’ OH of exon 1 on the 3’ splice site upstream of exon 2. This produces a lariatcontaining intron 1 and a pre-mRNA with exons 1 and 2 spliced together. At right: a model for the production of circular transcripts. If there is acanonical transcriptional start, and if intron excision does not proceed sequentially in time from the 5’ to 3’ direction of the pre-mRNA, non-canonicalpairing of 3’ and 5’ splice sites could be generated. Since the sequences of each 5’ splice site of the pre-mRNA contain the same splicing signals, it ispossible that the 3’ splice site upstream of exon 2 is paired with the 5’ splice site downstream of exon 3 and splicing proceeds as if this 5’ splice sitewere paired with the 3’ splice site upstream of exon 4. In this case, exon 3 would be spliced upstream of exon 2, creating a pre-mRNA intermediatecomprised of these two exons and intron 2. Canonical splicing would be predicted to excise this intron, leaving a circular RNA composed of exons 2and 3. Non-canonical transcription start, as suggested in , could produce an orphan 3’ splice site corresponding to the first transcribed exon. Thissplice site could be paired with a downstream 5’ splice site, generating a circular RNA. In both models, the excised intron would be linear andbranched, and expected to be quickly degraded.doi: /journal.pone g006
40 mRNA special casesBahn JH, Lee JH, Li G, Greer C, Peng G, Xiao X. (2011) Accurate identification of A-to-I RNA editing in human by transcriptome sequencing. Genome Res. 22(1):
43 miRNA gene expression “Small RNA-Seq” is also very important For a recent review see:Preethi H. Gunaratne, Cristian Coarfa, Benjamin Soibam and Arpit Tandon (2012) miRNA Data Analysis: Next-Gen Sequencing. In: Next-generation MicroRNA expression profiling technology, Fan, J.B. (Ed.)Methods in Molecular Biology, Vol. 822, , DOI: / _19
44 Review NGS instruments and data analysis software RNA-Seq overview mRNA gene expressionmRNA transcript isoformsmRNA special casesmiRNA gene expression