Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysis of Transcription using RNA-Seq

Similar presentations


Presentation on theme: "Analysis of Transcription using RNA-Seq"— Presentation transcript:

1 Analysis of Transcription using RNA-Seq
Introduction to Next-Generation Sequencing (NGS) Analysis of Transcription using RNA-Seq Dr. Robert Boissy SWH2048

2 Outline NGS instruments and data analysis software RNA-Seq overview
mRNA gene expression mRNA transcript isoforms mRNA special cases miRNA gene expression

3 NGS Instruments and software
Glenn TC. (2011) Field guide to next-generation DNA sequencers. Mol Ecol Resour. 11(5):

4 NGS Instruments and software

5 NGS Instruments and software

6 NGS Instruments and software

7 RNA-Seq overview Li J, Witten DM, Johnstone IM, Tibshirani R. (2011) Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics Oct 14.

8 McIntyre LM, Lopiano KK, Morse AM, Amin V, Oberg AL, Young LJ, Nuzhdin SV. (2011) RNA-seq: technical variability and sampling. BMC Genomics. 12:293.

9 RNA-Seq overview Auer PL, Doerge RW. (2010) Statistical design and analysis of RNA sequencing data. Genetics 185:

10 RNA-Seq overview Pevsner, J (2009) Bioinformatics and Functional Genomics Wiley-Blackwell p. 333

11 RNA-Seq overview Pevsner, J (2009) Bioinformatics and Functional Genomics Wiley-Blackwell p. 338

12 RNA-Seq overview Pevsner, J (2009) Bioinformatics and Functional Genomics Wiley-Blackwell p. 338

13 RNA-Seq overview Pevsner, J (2009) Bioinformatics and Functional Genomics Wiley-Blackwell p. 338

14 Pevsner, J (2009) Bioinformatics and Functional Genomics Wiley-Blackwell p. 350

15 RNA-Seq overview Pevsner, J (2009) Bioinformatics and Functional Genomics Wiley-Blackwell p. 353

16 RNA-Seq overview Pevsner, J (2009) Bioinformatics and Functional Genomics Wiley-Blackwell p. 354

17 mRNA gene expression Biostatistical expertise is essential
Study design and power estimates need to be worked out before sequencing Li J, Witten DM, Johnstone IM, Tibshirani R. (2011) Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics Oct 14. McCarthy DJ, Chen Y, Smyth GK. (2012) Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. (Feb. 6) Fang Z, Cui X. (2011) Design and validation issues in RNA-seq experiments. Brief Bioinform. 12(3):

18 mRNA gene expression Influence of sequencing depth
Influence of mappability Mercer TR, Gerhardt DJ, Dinger ME, Crawford J, Trapnell C, Jeddeloh JA, Mattick JS, Rinn JL. (2011) Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Nat Biotechnol. 30(1): Roberts A, Pachter L. (2011) RNA-Seq and find: entering the RNA deep field. Genome Med. 3(11):74. Derrien T, Estellé J, Marco Sola S, Knowles DG, Raineri E, Guigó R, Ribeca P. (2012) Fast computation and applications of genome mappability. PLoS One. 7(1):e

19 mRNA gene expression Figure 1. To demonstrate the difficulty of accurate isoform-level abundance estimation on low-abundance genes, we simulated an RNA CaptureSeq experiment on the 18 isoforms of the dystrophin gene as annotated in RefSeq (hg19), shown in (a). For each number of 76 bp paired-end fragments that aligned to the gene, we estimated abundances of each isoform using the online EM algorithm (for details of the model used see [8] and for details of the implementation see [11]). (b) The accuracy of isoform abundance estimation measured as the Pearson correlation coefficient (r) of the logged relative abundance estimates compared with the true abundance used to generate the simulated data. The results are averaged over four simulations from different random abundance distributions. Because of the similarity of the isoforms, only 2.5% of fragments aligned uniquely to a single isoform on average, making the deconvolution particularly difficult. The bottom x-axis shows how many alignable paired-end fragments would be required to achieve the same r in a genome-wide experiment as in the CaptureSeq simulation. Here we assume 3.17 fragments per kilobase per million mapped reads (FPKM) for the gene, which is what we estimated from a sample ENCODE dataset [accession SRR065495].

20

21 mRNA gene expression

22 mRNA gene expression Guidelines and reviews
Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium. Jiang L, Schlesinger F, Davis CA, Zhang Y, Li R, Salit M, Gingeras TR, Oliver B. (2011) Synthetic spike-in standards for RNA-seq experiments. Genome Res. 21(9): Garber M, Grabherr MG, Guttman M, Trapnell C. (2011) Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 8(6): Ramsköld D, Kavak E, Sandberg R. (2012) How to analyze gene expression using RNA-sequencing data. Methods Mol Biol. 802:

23 mRNA transcript isoforms
TopHat and related programs Trapnell C, Pachter L, Salzberg SL. (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9): Langmead B, Hansen KD, Leek JT. (2010) Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. 11(8):R83. Kim D, Salzberg SL. (2011) TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biol. 12(8):R72. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 28(5): Roberts A, Pimentel H, Trapnell C, Pachter L. (2011) Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics 27(17): Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. (2011) Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 12(3):R22.

24 mRNA transcript isoforms

25 mRNA transcript isoforms

26 Figure 1 | Strategies for gapped alignments of
RNA-seq reads to the genome. (a,b) An illustration of reads obtained from a two-exon transcript; black and gray indicate exonic origin of reads. Exon-first methods (a) map full, unspliced reads (exonic reads), and remaining reads are divided into smaller pieces and mapped to the genome. An extension process extends mapped pieces to find candidate splice sites to support a spliced alignment. Seed-and-extend methods (b) store a map of all small words (k-mers) of similar size in the genome in an efficient lookup data structure; each read is divided into k-mers, which are mapped to the genome via the lookup structure. Mapped k-mers are extended into larger alignments, which may include gaps flanked by splice sites. (c) A potential disadvantage of exon-first approaches illustrated for a gene and its associated retrotransposed pseudogene. Mismatches compared to the gene sequence are indicated in red. Exonic reads will map to both the gene and its pseudogene, preferring gene placement owing to lack of mutations, but a spliced read could be incorrectly assigned to the pseudogene as it appears to be exonic, preventing higher-scoring spliced alignments from being pursued. nature methods

27 Figure 2 | Transcriptome reconstruction methods
Figure 2 | Transcriptome reconstruction methods. (a) Reads originating from two different isoforms of the same genes are colored black and gray. In genome-guided assembly, reads are first mapped to a reference genome, and spliced reads are used to build a transcript graph, which is then parsed into gene annotations. In the genome-independent approach, reads are broken into k-mer seeds and arranged into a de Bruijn graph structure. The graph is parsed to identify transcript sequences, which are aligned to the genome to produce gene annotations. (b) Spliced reads give rise to four possible transcripts, but only two transcripts are needed to explain all reads; the two possible sets of minimal isoforms are depicted.

28 Figure 3 | An overview of gene expression quantification with RNA-seq.
(a) Illustration of transcripts of different lengths with different read coverage levels (left) as well as total read counts observed for each transcript (middle) and FPKM-normalized read counts (right). (b) Reads from alternatively spliced genes may be attributable to a single isoform or more than one isoform. Reads are color-coded when their isoform of origin is clear. Black reads indicate reads with uncertain origin. ‘Isoform expression methods’ estimate isoform abundances that best explain the observed read counts under a generative model. Samples near the original maximum likelihood estimate (dashed line) improve the robustness of the estimate and provide a confidence interval around each isoform’s abundance. (c) For a gene with two expressed isoforms, exons are colored according to the isoform of origin. Two simplified gene models used for quantification purposes, spliced transcripts from each model and their associated lengths, are shown to the right. The ‘exon union model’ (top) uses exons from all isoforms. The ‘exon intersection model’ (bottom) uses only exons common to all gene isoforms. (d) Comparison of true versus estimated FPKM values in simulated RNA-seq data. The x = y line in red is included as a reference.

29 Figure 4 | Overview of RNA-seq differential expression analysis.
(a) Expression microarrays rely on fluorescence intensity via a hybridization of a small number of probes to the gene RNA. RNA-seq gene expression is measured as the fraction of aligned reads that can be assigned to the gene. (b) A hypothetical gene with two isoforms undergoing an isoform switch between two conditions is shown. The total number of reads aligning to the gene in the two conditions is similar, but its distribution across isoforms changes. Differential expression using the simplified exon union or exon intersection methods reports no changes between conditions while estimating read counts and expression for the individual isoforms detects both differential expression at the gene and isoform level.

30 mRNA special cases Non-polyadenylated transcripts
Nascent transcripts + co-transcriptional splicing Circular transcripts A to I editing

31 We found that a few excised introns accumulate in cells and thus constitute a new class of non-polyadenylated long non-coding RNAs. Finally, we have identified a specific subset of poly(A)- histone mRNAs, including two histone H1 variants, that are expressed in undifferentiated hESCs and are rapidly diminished upon differentiation; further, these same histone genes are induced upon reprogramming of fibroblasts to induced pluripotent stem cells. Yang L, Duff MO, Graveley BR, Carmichael GG, Chen LL. (2011) Genomewide characterization of non-polyadenylated RNAs. Genome Biol. 12(2):R16.

32 Figure 3 Classification of bimorphic transcripts
Figure 3 Classification of bimorphic transcripts. (a) Gene ontology analysis of bimorphic transcripts according to their functions; see text for details. (b) Overlapping analysis of the expression of bimorphic transcripts in H9 and HeLa cells. (c) An example of a bimorphic histone mRNA, hist1h2afx. Normalized read densities of hist1h2afx from the UCSC genome browser (upper panels). Note that the two isoforms were distinct in poly(A)+ and poly(A)- samples from H9 and HeLa cells. Bottom panels, semi-quantitative RT-PCR with primers that recognize either the longer poly(A)+ transcript or both transcripts confirmed the observations from the deep sequencing. F, forward primer; 1R and 2R, reverse primers. The vertical arrow depicts the position of U7-mediated 3’ end formation. (d) Validation of identified bimorphic transcripts, ccng1(left panels), nr6a1 (right upper panels) and gprc5a (right bottom panels). Normalized read densities and qRT-PCRs were analyzed as described above. Note that the signals for these transcripts were similar in both poly(A)+ and poly(A)-samples. See text for details. Error bars were calculated from three biological repeats.

33 Figure 1 A large proportion of RNA-seq reads map to intronic regions
Figure 1 A large proportion of RNA-seq reads map to intronic regions. (a) The percentage of reads mapping to intronic, exonic and intergenic regions is shown for each RNA-seq dataset (on the y axis). For introns, only reads mapping to the same strand as the surrounding gene are considered. Reads located inside introns but on the opposite strand are labeled intergenic. Introns of length >100 kb (XL introns) are shown by separate bars. (b) Venn diagrams containing the number of identified genes with high intronic RNA score in human fetal and adult brain (left) and human fetal and adult liver (right). Ameur A, Zaghlool A, Halvardson J, Wetterbom A, Gyllensten U, Cavelier L, Feuk L. (2011) Total RNA sequencing reveals nascent transcription and widespread co-transcriptional splicing in the human brain. Nat Struct Mol Biol. 18(12):

34 mRNA special cases Figure 2 Nascent transcription and co-transcriptional splicing. (a) Pattern for AUTS2 (top) and C21orf34, a noncoding RNA gene (bottom), viewed in the University of California, Santa Cruz (UCSC) Genome Browser47. The RNA-seq signals have been smoothed using window averaging. For both protein coding genes and long noncoding RNA genes, there is an apparent ‘saw-tooth’ pattern with higher RNA-seq signal toward the 5′ ?end of each intron. (b) Model for co-transcriptional splicing. The total RNA-seq data give rise to a typical saw-tooth pattern across genes that are actively transcribed. The gradient of RNA across the introns can be explained by a large number of nascent transcripts in various stages of completion. The pattern is repeated for each intron because the nascent transcript is spliced very rapidly after the polymerase completes transcribing each intron. The sequence read coverage is comparatively higher for exons, as the RNA-seq is measuring both the pool of nascent transcripts and the pool of mature polyadenylated RNA.

35 Figure 3 A gradient of RNA levels within introns
Figure 3 A gradient of RNA levels within introns. (a) The relative RNA levels of the first two exons and the surrounding introns of GRID2 were measured in complementary DNA (cDNA) from human fetal frontal cortex total RNA using quantitative real-time PCR. The qrtPCR results (bottom) correlate with the intronic signals in the RNA-seq data (top). A schematic view of the two first exons of GRID2 and the intermediate intron is seen in the middle panel. The arrows indicate the location of primers (P1 to P11) used in the experiment. The qrtPCR values are based on three independent experiments (error bars are s.d.). (b) RNA-seq signal over all introns in the genome of at least 50 kb in length (L and XL introns), measured in average depth of coverage per million mapped reads (average dcpm). Each intron was divided into 100 bins, and an average value was calculated for each of the individual bins. The four human samples show a decrease of RNA signal across the intron, with the brain samples showing much steeper slopes compared to liver, indicating nascent transcription and co-transcriptional splicing. The dotted lines at the bottom show the RNA-seq signals on the opposite strand.

36 mRNA special cases Figure 1. Models to explain exon scrambling. The canonical linear reference transcript is depicted with exons as colored boxes with four exons 1, 2, 3, and 4. Two simple models of RNA structure that could explain scrambled transcripts are depicted at left and right. At left, model 1 depicts how a scrambled exon 3-exon 2 junction could arise from a tandem duplication of exons 3 and 2, positioning the first copy of exon 3 upstream of exon 2. At the RNA level, this event could arise from post-transcriptional exon rearrangement, or a genomic duplication of exons 2 and 3. Under the model of tandem duplication, when one side of a paired-end read maps to the junction between exon 3 and 2, the other may map to any of exons 1, 2, 3 or 4 with probabilities determined by the library’s insert length distribution and the exon lengths. Our data supports paired-end mapping between a junction and exons 2 or 3, but not exons 1 and 4. We note that in principle, the scrambled exon 3 - exon 2 junction could arise from other splicing events and does not necessarily entail tandem duplication. At right, model 2 depicts how a scrambled exon 3 - exon 2 junction could arise from splicing of exons 2 and 3 into a circular RNA molecule, again positioning exon 3 upstream of exon 2. In this model, when one side of a paired-end read maps to the junction between exon 3 and 2, the other will map to exon 2 or exon 3. doi: /journal.pone g001 Salzman J, Gawad C, Wang PL, Lacayo N, Brown PO. (2012) Circular RNAs Are the Predominant Transcript Isoform from Hundreds of Human Genes in Diverse Cell Types. PLoS One. 7(2):e

37 mRNA special cases Figure 2. Expression levels of scrambled exons. Analysis of paired-end RNA-Seq data from random primed libraries reveals evidence that scrambled exons are present at high stoichiometries compared to the canonical linear transcript transcribed from a large number of human genes. This phenomenon persists across cell types and is illustrated by the expression patterns of 3 leukocyte cell types: CD19 (B cells), CD34 (stem cells) and neutrophils. The fraction of each scrambled transcript as a fraction of total gene expression is computed. The bar plot depicts the number of circular isoforms with estimated abundance relative to all transcripts of the gene in the following ranges: between 0–25%, 25–50%, 50–75% and 75+%. Hundreds of isoforms in each cell type are estimated to represent more than half of all transcripts from each gene. doi: /journal.pone g002

38 Figure 3. RNaseR assay confirms scrambled exons arise from circular RNA. Panel A: Total RNA from HeLa cells was digested with RNaseR at varying enzyme concentrations (0, 3, 10, and 100 units) after the RNA was depleted of ribosomal RNA. Primers capable of amplifying the canonical linear transcript and the predicted circular transcript (by outward facing primers within a single exon predicted in the scramble) were used in a RTPCR experiment for each of the digestion conditions. Canonical transcripts were consistently degraded by RNaseR, only detectable by PCR at 0 units of RNaseR, whereas predicted circular transcripts consistently resisted the RNaseR challenge, providing strong evidence of circularity. FBXW4 and MAN1A2 respectively show 2 and 4 circular isoforms, both of which were predicted by the sequencing data. The predicted lengths of circular isoforms are respectively a 3-2 junction of CAMSAP1 (predicted to produce a 435 bp circle), a 4-2 and 5-2 junction of FBXW4 (predicted to produce 415 and 510 bp circles), a 4-2, 5-2 and 6-2 junction of MAN1A2 (predicted to produce 471, 553, and 648 bp circles), a 3-3 junction in REXO4 (predicted to produce a 338 bp circle), a 2-2 junction of RNF220 (predicted to produce a 742 bp circle) and a 3-2 junction of ZKSCAN1 (predicted to produce a 667 bp circle). Panel B: A northern blot on total and cytoplasmic lysate from HeLa cells shows hybridization of a 481 bp probe complementary to the MAN1A2 5-2 exon scramble. 3.7 and 6.2 ug of total and cytoplasmic RNA were loaded onto a 1% agarose gel and 10 pM of probe was hybridized for 24–48 hours. Detection was performed using the BrightStar BioDetect Kit (Ambion, Austin, TX). The specific band at 553 bp corresponds to the predicted size of a circular RNA containing exons 2,3,4 and 5 of MAN1A2. doi: /journal.pone g003

39 mRNA special cases Figure 6. Models for generation of circular RNA. At left: a schematic diagram of the canonical splicing process splicing out the first intron of the a pre-mRNA of a 4 exon gene, and subsequent removal of introns 2 and 3. Canonical splicing of exon 1 to exon 2 occurs when the splicing machinery catalyzes the formation of the intron lariat and the attack of the free 3’ OH of exon 1 on the 3’ splice site upstream of exon 2. This produces a lariat containing intron 1 and a pre-mRNA with exons 1 and 2 spliced together. At right: a model for the production of circular transcripts. If there is a canonical transcriptional start, and if intron excision does not proceed sequentially in time from the 5’ to 3’ direction of the pre-mRNA, non-canonical pairing of 3’ and 5’ splice sites could be generated. Since the sequences of each 5’ splice site of the pre-mRNA contain the same splicing signals, it is possible that the 3’ splice site upstream of exon 2 is paired with the 5’ splice site downstream of exon 3 and splicing proceeds as if this 5’ splice site were paired with the 3’ splice site upstream of exon 4. In this case, exon 3 would be spliced upstream of exon 2, creating a pre-mRNA intermediate comprised of these two exons and intron 2. Canonical splicing would be predicted to excise this intron, leaving a circular RNA composed of exons 2 and 3. Non-canonical transcription start, as suggested in [25], could produce an orphan 3’ splice site corresponding to the first transcribed exon. This splice site could be paired with a downstream 5’ splice site, generating a circular RNA. In both models, the excised intron would be linear and branched, and expected to be quickly degraded. doi: /journal.pone g006

40 mRNA special cases Bahn JH, Lee JH, Li G, Greer C, Peng G, Xiao X. (2011) Accurate identification of A-to-I RNA editing in human by transcriptome sequencing. Genome Res. 22(1):

41 mRNA special cases

42 Multiple EM for motif elicitation

43 miRNA gene expression “Small RNA-Seq” is also very important
For a recent review see: Preethi H. Gunaratne, Cristian Coarfa, Benjamin Soibam and Arpit Tandon (2012) miRNA Data Analysis: Next-Gen Sequencing. In: Next-generation MicroRNA expression profiling technology, Fan, J.B. (Ed.) Methods in Molecular Biology, Vol. 822, , DOI: / _19

44 Review NGS instruments and data analysis software RNA-Seq overview
mRNA gene expression mRNA transcript isoforms mRNA special cases miRNA gene expression


Download ppt "Analysis of Transcription using RNA-Seq"

Similar presentations


Ads by Google