Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to RNA-seq Joel Parker, Ph.D.. LCCC Biomedical Informatics UNCseq: Cancer genome analysis of 1000+ UNC Hospital patients TCGA: Processed.

Similar presentations


Presentation on theme: "Introduction to RNA-seq Joel Parker, Ph.D.. LCCC Biomedical Informatics UNCseq: Cancer genome analysis of 1000+ UNC Hospital patients TCGA: Processed."— Presentation transcript:

1 Introduction to RNA-seq Joel Parker, Ph.D.

2 LCCC Biomedical Informatics UNCseq: Cancer genome analysis of 1000+ UNC Hospital patients TCGA: Processed and distributed 8K+ cancer transcriptomes (>1PB) Cancer Survivorship Cohort: Recruit, track, and follow-up 4k+ patients Genome and transcriptome analytics in multiple clinical trials Management and analytics supporting 10+ clinical programs and 20+ faculty labs Contributing authors to 150+ manuscripts

3 Introduction to RNA-seq Advantages and challenges of RNA-seq Experimental Design Raw data Mapping Quantification Differential Expression

4 Why mRNAseq? There are at least four compelling reasons for choosing mRNA-seq instead of microarray based technologies – Specificity of what is being measured – Reduced technical (batch) bias – Increased dynamic range and log ratio (FC) estimates – More sensitive detection of genes, transcripts, and differential expression Other reasons – Detection of expressed SNVs – Detection of fusions and other structural variations – No transcriptome definition is needed – No probes need to be designed or manufactured – Cost (will soon be equivalent on a per assay basis with microarray)

5

6 Why mRNAseq? – Reduced Bias Cell types separate biologically CD19 CD8 CD14 CD4

7 Why mRNAseq? – Reduced Processing Bias Client’s miRNAseq samples sequenced on 4 different machines at 2 different sites at different times over several months with no apparent bias in the top principal components GAIIx HS-01 HS-02 HS-IL

8

9

10 Others: Stranded, Exome or target enrichment Blood, MT, etc Library preparation

11

12

13 Sequencing parameters Trapnell et al., Nature Biotechnology 31,46–53 (2013) Read Length

14 Detection is Dependent on Depth Detection in this case is defined as at least 10 fragments per million (FPM) assigned to the gene or isoform in at least 20% of samples. As the number of clusters increases, so does the number of genes (left) or isoforms (right), but not greatly over 10M. However, 50M is 5x10M but only yields about a 15% increase in detection level. GenesIsoforms

15 Liu et al., Bioinformatics (2014) 30 (3): 301-304.

16 Computational Processing Technical variation (batch effects) from library preparation and sequencing are small, and the sequencing strategy directs the level of repeatability and detection, especially depth The raw results of sequencing require significant computational processing – Alignment : Maximizing unambiguous alignments; Alignment of reads that cross exon junctions; Ex: Bowtie, BWA, TopHat – Abundance estimation : Gene or transcript; Handling alignments that are ambiguous in the transcriptome; Ex: Sailfish, RSEM, Cufflinks, MISO, IsoEM, IsoInfer, Rseq,... – Normalization of read counts : Minimizing bias due to variation in number of clusters available; Ex: Total count (RPM), Upper quartile, quantile, density Different algorithmic and computational strategies, especially the transcriptome definition, impact performance much more than SE vs. PE, 50 bp vs. 100 bp.

17

18 Alignment TopHat, MapSplice, STAR Trinity, Trans-Abyss

19 Read Exon Intron (1) Read r may be incorrectly mapped to the intron between exons e1 and e2. (2) Here, the read shown in red, which spans a splice junction, can be aligned end-to-end to a processed pseudogene. e1e2 r GTXX AG GTXX Incorrect mapping (non-gapped alignment) r GTXX Correct mapping (spliced alignment) e1e2e3 e1e2e3 Read Gene Processed pseudogene mapped

20

21 1) transcriptome mapping, which is used only when annotation is provided 2) genome mapping 3) Split read alignment of step 2 unmapped -novel splice sites are differentiated from indels and fusions using known junction signals (GT-AG, GC-AG, and AT- AC) supported by ‘islands’ and spliced alignments 4) Remapping of unaligned and previously poor mappings 5) Statistical assessment to assign most likely alignment of multi-mappers

22

23

24 Example Concordant Gene V2 V1 http://www.broadinstitute.org/igv/

25 Example Discordant 1 Gene V2 V1

26 Example Discordant 2 Gene V2 V1

27

28

29

30 Alignment Comparison Engstrom et al., Nature Methods 10, 1185-1191 (2013)

31 Alignment Comparison Engstrom et al., Nature Methods 10, 1185-1191 (2013) Splice Junction Accuracy

32 Many RNAseq Aligners Shown is the percentage of sequenced or simulated read pairs (fragments) mapped by each protocol. Protocols are grouped by the underlying alignment program (gray shading). Protocol names contain the suffix “ann” if annotation was used. The suffix “cons” distinguishes more conservative protocols from others based on the same aligner. The K562 data set comprises six samples, and the metrics presented here were averaged over them. Systematic evaluation of spliced alignment programs for RNA-seq data. Engström et al, Nature Methods 2013

33 Computational Processing Technical variation (batch effects) from library preparation and sequencing are small, and the sequencing strategy directs the level of repeatability and detection, especially depth The raw results of sequencing require significant computational processing – Alignment : Maximizing unambiguous alignments; Alignment of reads that cross exon junctions; Ex: Bowtie, BWA, TopHat – Abundance estimation : Gene or transcript; Handling alignments that are ambiguous in the transcriptome; Ex: Sailfish, RSEM, Cufflinks, MISO, IsoEM, IsoInfer, Rseq,... – Normalization of read counts : Minimizing bias due to variation in number of clusters available; Ex: Total count (RPM), Upper quartile, quantile, density Different algorithmic and computational strategies, especially the transcriptome definition, impact performance much more than SE vs. PE, 50 bp vs. 100 bp.

34 Long Short 200 Medium 100 50 1 2 3 Relative abundance for these genes, f 1, f 2, f 3 350 300 200 150 Unique Multireads Multireads: Reads Mapping to Multiple Genes/Transcripts N

35 Approach 1: Ignore Multireads Long Short 200 Medium 100 50 1 2 3 Relative abundance for these genes, f 1, f 2, f 3 350 300 200 150 Nagalakshmi et. al. Science. 2008 Marioni, et. al. Genome Research 2008 N

36 Approach 1: Ignore Multireads Long Short 200 Medium 100 50 1 2 3 350 300 200 150 Over-estimates the abundance of genes with unique reads Under-estimates the abundance of genes with multireads Not an option at all, if interested in isoform expression N

37 Approach 2: Allocate Fraction of Multireads Using Estimates From Uniques Long Short 200 Medium 100 50 1 2 3 Relative abundance for these genes, f 1, f 2, f 3 350 300 200 150 Ali Mortazavi, et. al. Nature Methods 2008 Sailfish, RSEM,Cufflinks N

38 Multireads: Reads Mapping to Multiple Genes/Transcripts Long Medium N Wang X, Wu Z, Zhang X. Isoform abundance inference provides a more accurate estimation of gene expression levels in RNA-seq. J Bioinform Comput Biol. 2010 Dec;8 Suppl 1:177-92. PubMed PMID: 21155027.

39 Cufflinks PMID: 20436464

40 RSEM Li and Dewey, 2011 PMID: 21816040 A) PE isoform; B) PE gene; C) SE isoform; D) SE gene θ i represents the probability that a fragment is derived from transcript i

41 Current Methods RSEM – demonstrated high accuracy as part of MAQC experiment, but run time scales exponentially with read count and transcript definitions eXpress – EM similar to RSEM, but processes reads in a streaming fashion Sailfish – EM similar to RSEM, but replace approximate alignment of reads with exact alignment of k-mers Kallisto – similar to sailfish, but further reduced alignment complexity (very fast!) Initial evidence was that eXpress / sailfish / Kallisto demonstrated a drop in accuracy relative to RSEM. Then... Salmon – uses a combination of alignment informed k-mer matching and a combination of the Kallisto variational Bayes approach with a ‘local’ EM approach

42 Salmon Novelties Streaming variational Bayes (VB) inference combined with batched VB or EM Lightweight alignment through maximal exact matches Transcript / gene abundance inference is abstracted from the alignment step [RSEM also permits this; sam-xlate in https://github.com/mozack/ubu/wiki]

43

44

45

46 Computational Processing Technical variation (batch effects) from library preparation and sequencing are small, and the sequencing strategy directs the level of repeatability and detection, especially depth The raw results of sequencing require significant computational processing – Alignment : Maximizing unambiguous alignments; Alignment of reads that cross exon junctions; Ex: Bowtie, BWA, TopHat – Abundance estimation : Gene or transcript; Handling alignments that are ambiguous in the transcriptome; Ex: Sailfish, RSEM, Cufflinks, MISO, IsoEM, IsoInfer, Rseq,... – Normalization of read counts : Minimizing bias due to variation in number of clusters available; Ex: Total count (RPM), Upper quartile, quantile, density Different algorithmic and computational strategies, especially the transcriptome definition, impact performance much more than SE vs. PE, 50 bp vs. 100 bp.

47 Normalization

48 FPM – fragments per million -Statistical tools like DESeq and SAMseq do not utilize these transformations -Quartile normalization (preferred) uses arbitrary units Typical measures of expression

49 Repeatability & Detection by Isoform Database Larger reference transcriptomes result in reduced repeatability (left), but increased detection (right) Detection - 73% of RefSeq, 66% of UCSC, and 52% of Ensembl

50 Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010 Feb 18;11:94. doi: 10.1186/1471-2105-11-94. PubMed PMID: 20167110; PubMed Central PMCID: PMC2838869 Differential Expression

51

52 Summary Consideration of what is needed versus wanted – Relative expression of genes or isoforms – Snps / indels – Transcript discovery – Fusion detection Determines optimal – library prep – seq length – seq quantity – SE or PE – number of replicates – processing protocol

53 Why mRNAseq? There are at least four compelling reasons for choosing mRNA-seq instead of microarray based technologies – Specificity of what is being measured – Reduced technical (batch) bias – Increased dynamic range and log ratio (FC) estimates – More sensitive detection of genes, transcripts, and differential expression Other reasons – Detection of expressed SNVs – Detection of fusions and other structural variations – No transcriptome definition is needed – No probes need to be designed or manufactured – Cost (will soon be equivalent on a per assay basis with microarray)

54 Fusion Genes High expression of ALK kinase domain ~Zero expression of upstream ALK Gene orientation

55 Fusions Gene 1 Gene 2 Gene1-Gene2 Fusion Spanning Reads Bridging Reads

56 Spanning Alignment View EML4 exon 13ALK exon 20 Fusion spanning reads

57 Fusion Genes TCGA, LUAD, submitted

58 Virus Detection HNSCC HPV16 #1 #2 #3 #4 E7E2L1 E6E1E4L2E5

59 Detecting tumor mutations by integration of DNA and RNA sequencing via UNCeqR Matthew Wilkerson, UNC Chapel Hill Weighted combination of DNA-Whole exome sequencing and RNA sequencing RNAseq can help increase somatic mutation detection particularly in low purity tumors http://lbg.med.unc.edu/~mwilkers/unceqr

60 UNCeqR PIK3CA H1047R TCGA-AR-A252 Luminal A subtype http://lbg.med.unc.edu/~mwilkers/unceqr

61 RNA often provides a greater mutation signal than DNA

62 RNA integration boosts performance in tumors have low purity

63 Novel mutations relative to published profiles


Download ppt "Introduction to RNA-seq Joel Parker, Ph.D.. LCCC Biomedical Informatics UNCseq: Cancer genome analysis of 1000+ UNC Hospital patients TCGA: Processed."

Similar presentations


Ads by Google