Presentation is loading. Please wait.

Presentation is loading. Please wait.

RNA-seq: From experimental design to gene expression. Steve The Jackson Laboratory UMaine Computational Methods in Biology/ Genomics.

Similar presentations


Presentation on theme: "RNA-seq: From experimental design to gene expression. Steve The Jackson Laboratory UMaine Computational Methods in Biology/ Genomics."— Presentation transcript:

1 RNA-seq: From experimental design to gene expression. Steve The Jackson Laboratory UMaine Computational Methods in Biology/ Genomics September 29, 2014

2 Outline General overview of RNA-seq analysis. Introduction to RNA-seq The importance of a good experimental design Quality control Read alignment Quantifying isoform and gene expression Normalization of expression estimates

3 DeRisi et. al. Science 1997Alwine et. al. PNAS 1977 Understanding gene expression 3 ON/OFF

4 Illumina HiSeq and MiSeq 454 GS FLX Oxford Nanopore Next Generation Genome Sequencers N PacBio SMRT

5 RNA-Seq: Sequencing Transcriptomes ATGCTCA AGCTA TAGATGCTCA AGCTA ATGCTCA AGCTAATC ATGCTCA AGCTA AGTAGATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA TAGATGCTCAAGCTAATC AGCTAATCCTAG CTCA RNA N

6 Applications of RNA-Seq Technology A A A A A G G G A A ATGCTCA AGCTA TAGATGCTCA AGCTA ATGCTCA AGCTAATC ATGCTCA AGCTA AGTAGATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA TAGATGCTCAAGCTAATC AGCTAATCCTAG CTCA TAGATGCTCAAAGCTAATCCTAG Gene expression analysis Novel Exon discovery A/GA/G A A A G G G G G A A ATGCTCA AGCTA TAGATGCTCA AGCTA ATGCTCA AGCTAATC ATGCTCA AGCTA AGTAGATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA TAGATGCTCAAGCTAATC AGCTAATCCTAG CTCA TAGATGCTCAAGCTAATCCTAG RNA Editing Allele Specific Gene Expression N

7 mRNA mRNA after fragmentation cDNA Adaptors ligated to cDNA Single/ Paired End Sequencing RNA-Seq Total RNA N

8 Challenges in RNA-Seq Work Flow Aligned Reads Quantified isoform and gene expression Sequencing Reads (SE or PE) RNA isolation/ Library Prep Study Design N

9 Know your application – Design your experiment accordingly Differential expression of highly expressed and well annotated genes? – Smaller sample depth; more biological replicates – No need for paired end reads; shorter reads (50bp) may be sufficient. – Better to have 20 million 50bp reads than 10 million 100bp reads. Looking for novel genes/splicing/isoforms? – More read depth, paired-end reads from longer fragments. Allele specific expression – “Good” genomes for both strains. N

10 Good Experimental Design One Illumina HiSeq Flowcell = 8 lanes Multiplex up to 24 samples in a lane 187 Million SE reads per lane 374 Million PE reads per lane Cost per lane: 100 bases PE: $ $ bases PE: $ $ bases SE: $800 - $1200 Multiplexing Replication Randomization N

11 Two Illumina Lanes Bad Design Random.org RNA-Seq Experimental Design: Randomization Experimental Group 2Experimental Group 1 N

12 Two Illumina Lanes Bad DesignGood Design Random.org RNA-Seq Experimental Design: Randomization Experimental Group 2Experimental Group 1 N

13 Two Illumina Lanes Bad DesignGood Design Random.org RNA-Seq Experimental Design: Randomization Experimental Group 2Experimental Group 1 N Better Design

14 Challenges in RNA-Seq Work Flow Aligned Reads Quantified isoform and gene expression Sequencing Reads (SE or PE) RNA isolation/ Library Prep Study Design N

15 mRNA mRNA after fragmentation cDNA Adaptors ligated to cDNA poly-A tail selection Not actually random “Not So” random primers Adapter dimers Size selection step (gel extraction) PCR amplification Total RNA S

16 Index X-Y Coordinate in flowcell Flowcell lane and tile number Instrument: run/flowcell id The member of a pair Sequence Read: Sanger fastq TCACCCGTAAGGTAACAAACCGAAAGTATCCAAAGCTAAAAGAAGTGGACGACGTGCTTGGTGGAGCAGCTGCATG + CCCFFFFFHHHHDHHJJJJJJJJIJJ?FGIIIJJJJJJIJJJJJJFHIJJJIJHHHFFFFD>AC?B??C?ACCAC>BB Phred Score: Q = -10 log 10 P 10 indicates 1 in 10 chance of error 20 indicates 1 in 100, 30 indicates 1 in 1000, SN

17 QUALITY CONTROL To Trim or not to Trim? S

18 Quality Control: Sequence quality per base position Good data  Consistent  High Quality Along the reads Bad data  High Variance  Quality Decrease with Length S

19 NGS Data Preprocessing Per sequence quality distribution Y= number of reads X= Mean sequence quality bad data Average data S

20 NGS Data Preprocessing Per sequence quality distribution Y= number of reads X= Mean sequence quality bad data Average data Good data S

21 Quality Control: Sequence Content Across Bases S

22 NGS Data Preprocessing K-mer content counts the enrichment of every 5-mer within the sequence library Bad: If k-mer enrichment >= 10 fold at any individual base position

23 K-mer content Most samples

24 NGS Data Preprocessing Duplicated sequences Good: non-unique sequences make up less than 20% Bad: non-unique sequences make >50% S

25 PCR duplicates or high expressed genes? The case of Albumin. ~80,000x coverage here S

26 Tradeoffs to preprocessing data Signal/noise -> Preprocessing can remove low- quality “noise”, but the cost is information loss. – Some uniformly low-quality reads map uniquely to the genome. – Trimming reads to remove lower quality ends can adversely affect alignment, especially if aligning to the genome and the read spans a splice site. – Duplicated reads or just highly expressed genes? – Most aligners can take quality scores into consideration. – Currently, we do not recommend preprocessing reads aside from removing uniformly low quality samples. S

27 The problem with trimming all SE reads Longer is better for splice junction spanning reads 100bp reads All reads trimmed to 75 bp S

28 FASTX-Toolkit – FastQC – c/ c/ NGS Data Preprocessing Quality Control: Resources S

29 RNA-Seq Work Flow Aligned Reads Quantified isoform and gene expression Sequencing Reads (SE or PE) Total RNA Study Design KB

30 Alignment 101 ACATGCTGCGGA 100bp Read Chr 1 Chr 2 Chr 3 KB

31 The perfect read: 1 read = 1 unique alignment. ACATGCTGCGGA 100bp Read ✓ Chr 1 Chr 2 Chr 3 KB

32 Some reads will align equally well to multiple locations. “Multireads” ACATGCTGCGGA 100bp Read ✓ ✗ ✗ 1 read 3 valid alignments Only 1 alignment is correct KB

33 The worst case scenario. ACATGCTGCGGA ACATGCTGCGGC ACATGCTGCGGA 100bp Read ✗ ✗ 1 read 2 valid alignments Neither is correct KB

34 Individual genetic variation may affect read alignment. ACATGCTGCGGA ACATGCTGCGGC ACATGCTGCGGA ATATGCTGCGGA 100bp Read ✗ ✗ KB

35 Aligning Billions of Short Sequence Reads Gene A Gene B Aligners: Bowtie, GSNAP, BWA, MAQ, BLAT Designed to align the short reads fast, but not accurate KB

36 Gene A Gene B Gene family (orthologous/paralogous) Low-complexity sequence Alternatively spliced isoforms Pseudogenes Aligning Sequence Reads Polymorphisms Indels Structural Variants Reference sequence Errors KB

37 Align to Genome or Transcriptome? Genome Transcriptome KB

38 Exon 2 Exon 3 Exon 1 Aligning to Reference Genome: Exon First Alignment KB

39 Exon 2 Exon 3 Exon 1 Aligning to Reference Genome: Exon First Alignment KB

40 Exon 2 Exon 3 Exon 1 Aligning to Reference Genome: Exon First Alignment KB

41 Exon 2 Exon 3 Exon 1 Exon First Alignment: Pseudo-gene problem Exon 2 Exon 3 Exon 1 Processed Pseudo-gene KB

42 Exon 2 Exon 3 Exon 1 Exon First Alignment: Pseudo-gene problem Exon 2 Exon 3 Exon 1 Processed Pseudo-gene KB

43 Exon 2 Exon 3 Exon 1 Exon First Alignment: Pseudo-gene problem Exon 2 Exon 3 Exon 1 Processed Pseudo-gene KB

44 Align to Genome or Transcriptome? Genome Transcriptome Advantages: Can align novel isoforms. Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes KB

45 Exon 2 Exon 3 Exon 1 Isoform 1 Isoform 2 Isoform 3 Reference Transcriptome Gene 1 Exon 2 Exon 3 Exon 1 Isoform 1 Isoform 2 Isoform 3 Gene 2 Exon 4 KB

46 Align to Genome or Transcriptome? Genome Transcriptome Advantages: Easy, Focused to the part of the genome that is known to be transcribed. Disadvantages: Reads that come from novel isoforms may not align at all or may be misattributed to a known isoform. Advantages: Can align novel isoforms. Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes KB

47 Better Approach: Aligning to Transcriptome and Genome Align to Transcriptome First Advantages: relatively simpler, overcomes the pseudo-gene and novel isoform problems Advantages: Can align novel isoforms. Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes Align the remaining reads to Genome next RUM, TopHat2, STAR KB

48 Conclusions There is no perfect aligner. Pick one well-suited to your application. – E.g. Want to identify novel exons? Don’t align only to the known set of isoforms. Visually inspect the resulting alignments. Setting a parameter a little too liberal or conservative can have a huge effect on alignment. Consider running the same fastq files through multiple alignment pipelines specific to each application. – Gene expression -> Bowtie to transcriptome – Exon discovery -> RUM or other hybrid mapper – Variant detection -> GSNAP, GATK, Samtools mpileup If your species has not been sequenced, use a de novo assembly method. Can also use the genome of a related species as a scaffold. KB

49 Output of most aligners: Bam/Sam file of reads and genome positions S

50 Visualization of alignment data (BAM/SAM) Genome browsers – UCSC, IGV, etc.

51 IGV is your friend. Read color = strand SNP Coverage density plot

52 Aligned Reads to Gene Abundance Aligned Reads Quantified isoform and gene expression 100bp Reads Total RNA N

53 Aligned Reads to Gene Abundance: Challenges Long Short Many approaches to quantify expression abundance N

54 Long Short 200 Medium reads Relative abundance for these genes, f 1, f 2, f 3 Aligned Reads to Gene Abundance: Challenges N

55 Long Short 200 Medium Relative abundance for these genes, f 1, f 2, f Aligned Reads to Gene Abundance: Challenges N

56 Long Short 200 Medium Relative abundance; f 1 =0.2, f 2 =0.4, f 3 = Aligned Reads to Gene Abundance: Challenges N

57 Multireads: Reads Mapping to Multiple Genes/Transcripts Long Medium N

58 Gene A Gene B Gene multireads Gene C Reads that map to multiple genomic locations N

59 Exon 2 Exon 3 Exon 1 Isoform 1 Isoform 2 Isoform 3 Isoform multireads N

60 Long Short 200 Medium Relative abundance for these genes, f 1, f 2, f Unique Multireads Multireads: Reads Mapping to Multiple Genes/Transcripts N

61 Approach 1: Ignore Multireads Long Short 200 Medium Relative abundance for these genes, f 1, f 2, f Nagalakshmi et. al. Science Marioni, et. al. Genome Research 2008 N

62 Approach 1: Ignore Multireads Long Short 200 Medium Over-estimates the abundance of genes with unique reads Under-estimates the abundance of genes with multireads Not an option at all, if interested in isoform expression N

63 Approach 2: Allocate Fraction of Multireads Using Estimates From Uniques Long Short 200 Medium Relative abundance for these genes, f 1, f 2, f Ali Mortazavi, et. al. Nature Methods 2008 RSEM,Cufflinks N

64 Multireads: Reads Mapping to Multiple Genes/Transcripts Long Medium N

65 Conclusions for quantitation Isoform-level estimates will become easier as read length increases. EM approaches are currently the best option. Bad alignment = bad quantitation Pseudogenes are problematic. Check your alignments -> adjust method. Iterative process. N

66 NORMALIZATION A speed bump on the road from raw counts to differential expression. S

67 Large pool, small sample problem Typical RNA library estimated to contain 2.4 x molecules. McIntyre et al 2011 Typical sequencing run = 25 million reads/sample. This means that only (1/1000 th of a percent) of RNA molecules are sampled in a given run. High abundance transcripts are sampled more frequently. Example: Albumin = 13% of all reads in liver RNA-seq samples. Sampling errors affect low-abundance transcripts most. S

68 A finite pool of reads. S

69 Alb Low1 Sample 1 Alb Low1 Sample 2 Perfect world: All transcripts counted. S

70 Alb Low1 Sample 1 Real world: More reads taken up by highly expressed genes means less reads available for lowly expressed genes. S

71 Alb Low1 Sample 1 Sample 2 Highly expressed genes that are differentially expressed can cause lowly expressed genes that are not actually differentially expressed to appear that way. S

72 Normalization of raw counts Wrong way to normalize data – Normalizing to the total number of mapped reads (e.g. FPKM). Top 10 highly expressed genes soak up 20% of reads in the liver. FPKM is widely used, and problematic. Better ways to measure data – Normalize to upper quartile (75 th %) of non-zero counts, median of scaled counts (DESeq), or the weighted trimmed mean of the log expression ratios (EdgeR). S

73 Normalizing for gene length: Longer transcripts will produce relatively more reads when fragmented. Long Gene Short Gene Sample 1 Only necessary if you are comparing the relative expression levels of two genes in one sample. TPM = Transcripts per million. S

74 Summary ATGCTCA AGCTA TAGATGCTCA AGCTA ATGCTCA AGCTAATC ATGCTCA AGCTA AGTAGATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA TAGATGCTCAAGCTAATC AGCTAATCCTAG CTCA RNA As sequences get longer, alignment and isoform quantitation becomes easier!

75 Summary ATGCTCA AGCTA TAGATGCTCA AGCTA ATGCTCA AGCTAATC ATGCTCA AGCTA AGTAGATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA TAGATGCTCAAGCTAATC AGCTAATCCTAG CTCA RNA Experimental Design QC and Analysis Pipeline

76 Resources Aligner – Bowtie 2 – GSNAP – Tophat Transcript Abundance – Cufflinks – IsoEM – RSEM – Miso – Cuffdiff – DESeq – edgeR Denovo Assembly – Trinity

77 Acknowledgements Narayanan Raghupathy, KB Choi Gary Churchill Ron Korstanje/ Karen Svenson/ Elissa Chesler Joel Graber Anuj Srivastava Churchill Lab – Dan Gatti Al Simons and Matt Hibbs Lisa Somes, Steve Ciciotte, mouse room staff Mice

78 Thank you!

79

80

81

82 Differential Expression Over-estimation of Under-estimation of Too conservative Too sensitive (Many false positives) Expression abundance tends to overdispersed in RNA-seq measurements. Simple Poisson model would result in under-estimation of variance and many false positive calls.

83 Differential Expression A term that addresses over-dispersion The majority of DE detection methods are trying to estimate this guy in their own way. (ex) DESeq, edgeR, EBSeq, DSS, NBPSeq, baySeq, etc. Poisson

84 Multiple Testing Correction and False Discovery rate XKCD Significant 2012 IgNobel prize in Neuroscience for “finding Brain activity signal in dead salmon using fMRI” N

85 Northern Blot


Download ppt "RNA-seq: From experimental design to gene expression. Steve The Jackson Laboratory UMaine Computational Methods in Biology/ Genomics."

Similar presentations


Ads by Google