6 Example: Paired-end Data Aligned Some reads are informative about isoform-specific expression
7 Paired-end RNA-Seq for RNA Isoform Specific Gene Expression Exon 4Exon 1Since the size distribution of library molecules is known, inferred insert lengths can be used to increase statistical power and inferenceRnpepGoal: estimate the expression of each isoform?Nontrivial : we only observe fragments of sequences
8 Insert Length Distributions Insert lengths of entire library (pooled) can be calculated and used to precisely estimate the distribution of sizes of cDNA in the library:Base pairsSequenced moleculelength
9 Paired-end RNA-Seq Model Compute genome-wide insert length distributionBase pairsSequenced moleculelengthMapped to Isoform 1 length 150Mapped to Isoform 2 length 90Salzman, Jiang, Wong 2011
10 Using PE for quantification is statistically more powerful PE model is a statistical improvement over naïve models and has optimal information reduction“Information” gain using PE SequencingOverall, using “mate pair” information, more power, but sometimes experimental artifacts can effect results
11 Paired-end Size Distributions are Foundation for Tophat and other PE-RNA Seq AlgorithmsSummary and Problems:rely on a referenceassume uniformity of size distributions in libraryover look biases’Rep.1Rep.2
12 Paired-End RNA-Seq for Gene Fusions in Ovarian Tumors (2009) Paired-end sequencing of poly-A selected RNA from 12 late stage tumors– genome wide searchTop hit of our novel algorithm : ESRRA-C11orf20C11orf20ESRRAFusionIsoform-specific estimation: ESRRA and the fusion are expressed at roughly equal magnitude (Salzman, Jiang, Wong)
14 Recurrent Gene Fusions in Cancer A handful of recurrent fusions in solid tumorsPAX8 -PPARγ fusion (thyroid cancer)EML4-ALK fusion (non small cell lung cancer)TMPRSS2-ERG family fusion (prostate cancer)Not Genome-wideMore to be learned by unbiased study of RNA
15 Fusion Discovery 2 flavors Totally “de novo” discovery Search for any RNA fragments out of order with respect to the reference genome– not necessarily coinciding with exon boundariesNoisyDiscovery with a reference databaseDiscover fusions at annotated exon boundaries (protein coding) and better statistical checksMisses some fusions
16 Reference ApproachSearch for gene fusions with exon A in gene 1 spliced to exon B of gene 2Exon AExon B
17 Algorithm (with respect to reference) Remove all PE reads consistent with the referenceIdentify gene pairs PE reads where (read1, read2) map to (gene1, gene2)Find PE reads of the form:(gene A, gene A-B junction)Exon AExon B
18 Paired-End RNA-Seq for Gene Fusions in Ovarian Tumors Paired-end sequencing of poly-A selected RNA from 12 late stage tumors– genome wide searchTop hit of our algorithm : ESRRA-C11orf20C11orf20ESRRAFusionIsoform-specific estimation: ESRRA and the fusion are expressed at roughly equal magnitude (Salzman, Jiang, Wong)Salzman et al, 2011
19 Part 3: Exploratory Analysis of RNA Rearrangements
20 Exploratory analysis: biological “noise” in RNA-Seq Data Wildtype genome: DNACanonical transcriptLocally rearrangedDNAScrambled transcriptIs exon scrambling present in rRNA-depleted RNA?
21 Bioinformatic Analysis Thousands of exon scrambling events in RNA from human leukocytes and cancer samplesWildtype genome: DNACanonical transcriptInconsistent with the reference genome!
22 Potential Biological Mechanisms for RNA Rearrangements DNA RearrangementRNA rearrangementTrans-splicingTemplate switchingPCR artifact
23 Analysis of Leukocyte Data Exons in ‘scrambled’ (non-increasing) order with respect to canonical exon orderThousands of genes with evidence of exon scramblingNaïve estimate of fractional abundance ofscrambled read rate: all read rate (per transcript)
24 100s of Transcripts with High Fractions of Scrambled Isoforms Canonical Isoform100sofgenes< 25%Scrambled Isoform> 75%100s of transcripts from B cells, stem cells and neutrophils have >50% copies from scrambled isoform
25 What Models Can Explain Exon Scrambling in RNA?
27 Model 1 Prediction Can be made statistically precise Model 1 is statistically inconsistent with vast majority of dataA subset of genes have evidence of tandem duplication in mRNAAgainst Model 1For Model 12000-1000-100 -Transcripts with evidence
29 Mining RNA-Seq Data for Evidence Consistent with Circular RNA? In poly-A depleted samples, expect to see strong evidence of scrambled exons (circular RNA)In poly-A selected samples, expect to see little evidence of scrambled exons (circular RNA)
30 Poly-A Depleted Samples Enriched for Scrambled Exons Align all reads to a custom database
31 Summary of RNA-Seq for NGS RNA-Seq can be used for discoveryTophat and other fusion/splicing algorithms gives a broad pictureMay have significant noiseMiss important features of RNA expression
32 (feel free to contact me for the algorithm to identify circular RNA!) Currently, all published/downloadable algorithms will miss identifying circular RNA!(feel free to contact me for the algorithm to identify circular RNA!)In poly-A depleted samples, expect to see strong evidence of scrambled exons (circular RNA)In poly-A selected samples, expect to see little evidence of scrambled exons (circular RNA)