Presentation is loading. Please wait.

Presentation is loading. Please wait.

RNA-seq: From experimental design to gene expression.

Similar presentations


Presentation on theme: "RNA-seq: From experimental design to gene expression."— Presentation transcript:

1 RNA-seq: From experimental design to gene expression.
Steve Munger @stevemunger The Jackson Laboratory UMaine Computational Methods in Biology/ Genomics September 29, 2014 Thank Andre

2 Outline General overview of RNA-seq analysis. Introduction to RNA-seq
The importance of a good experimental design Quality control Read alignment Quantifying isoform and gene expression Normalization of expression estimates

3 Understanding gene expression
Alwine et. al. PNAS 1977 DeRisi et. al. Science 1997 ON/OFF 1.5 1.5 10.5

4 Next Generation Genome Sequencers
Illumina HiSeq and MiSeq PacBio SMRT 454 GS FLX Oxford Nanopore N

5 RNA-Seq: Sequencing Transcriptomes
ATGCTCA AGCTA TAGATGCTCA AGCTA ATGCTCA AGCTAATC ATGCTCA AGCTA AGTAGATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA TAGATGCTCA AGCTAATC CTCA AGCTAATCCTAG N

6 Applications of RNA-Seq Technology
Gene expression analysis Novel Exon discovery A/G A G ATGCTCA AGCTA TAGATGCTCA AGCTAATC AGTAGATGCTCA AGCTAATCCTAG CTCA A G ATGCTCA AGCTA TAGATGCTCA AGCTAATC AGTAGATGCTCA AGCTAATCCTAG CTCA TAGATGCTCAA N Allele Specific Gene Expression RNA Editing

7 RNA-Seq Total RNA mRNA mRNA after fragmentation cDNA Adaptors ligated
to cDNA Single/ Paired End Sequencing N

8 Challenges in RNA-Seq Work Flow
Study Design RNA isolation/ Library Prep Sequencing Reads (SE or PE) Aligned Reads Quantified isoform and gene expression N

9 Know your application – Design your experiment accordingly
Differential expression of highly expressed and well annotated genes? Smaller sample depth; more biological replicates No need for paired end reads; shorter reads (50bp) may be sufficient. Better to have 20 million 50bp reads than 10 million 100bp reads. Looking for novel genes/splicing/isoforms? More read depth, paired-end reads from longer fragments. Allele specific expression “Good” genomes for both strains. N

10 Good Experimental Design
One Illumina HiSeq Flowcell = 8 lanes Multiplex up to 24 samples in a lane Multiplexing Replication Randomization 187 Million SE reads per lane 374 Million PE reads per lane Cost per lane: 100 bases PE: $ $2700 50 bases PE: $ $2100 50 bases SE: $800 - $1200 N

11 RNA-Seq Experimental Design: Randomization
Experimental Group 1 Experimental Group 2 Two Illumina Lanes Random.org Bad Design N

12 RNA-Seq Experimental Design: Randomization
Experimental Group 1 Experimental Group 2 Two Illumina Lanes Random.org Bad Design Good Design N

13 RNA-Seq Experimental Design: Randomization
Experimental Group 1 Experimental Group 2 Two Illumina Lanes Random.org Bad Design Good Design Better Design N

14 Challenges in RNA-Seq Work Flow
Study Design RNA isolation/ Library Prep Sequencing Reads (SE or PE) Aligned Reads Quantified isoform and gene expression N

15 Total RNA poly-A tail selection mRNA mRNA after fragmentation Not actually random cDNA “Not So” random primers Size selection step (gel extraction) Adaptors ligated to cDNA PCR amplification Adapter dimers S

16 Sequence Read: Sanger fastq format
@HISEQ2000_0074:8:1101:7544:2225#TAGCTT/1 TCACCCGTAAGGTAACAAACCGAAAGTATCCAAAGCTAAAAGAAGTGGACGACGTGCTTGGTGGAGCAGCTGCATG + @HISEQ2000_0074:8:1101:7544:2225#TAGCTT/1 The member of a pair Instrument: run/flowcell id Flowcell lane and tile number # = 35 – 33 = 2 D = = 35 J = 74 – 33 = 41 = Highest Quality Score 1 = 49 – 33 = 16 F = 70 – 33 = 37 “=“ = 61 – 33 = 28 H = 72 – 33 = 39 Index Sequence X-Y Coordinate in flowcell Q = -10 log10 P 10 indicates 1 in 10 chance of error 20 indicates 1 in 100, 30 indicates 1 in 1000, Phred Score: SN

17 To Trim or not to Trim? Quality control S

18 Quality Control: Sequence quality per base position
Good data Consistent High Quality Along the reads Bad data High Variance Quality Decrease with Length S

19 Per sequence quality distribution
bad data Y= number of reads X= Mean sequence quality Average data NGS Data Preprocessing S

20 Per sequence quality distribution
bad data Good data Y= number of reads X= Mean sequence quality Average data NGS Data Preprocessing S

21 Quality Control: Sequence Content Across Bases

22 K-mer content counts the enrichment of every 5-mer within the sequence library Bad: If k-mer enrichment >= 10 fold at any individual base position NGS Data Preprocessing

23 K-mer content Most samples

24 NGS Data Preprocessing
Duplicated sequences Good: non-unique sequences make up less than 20% Bad: non-unique sequences make >50% NGS Data Preprocessing S

25 PCR duplicates or high expressed genes? The case of Albumin.
~80,000x coverage here S

26 Tradeoffs to preprocessing data
Signal/noise -> Preprocessing can remove low-quality “noise”, but the cost is information loss. Some uniformly low-quality reads map uniquely to the genome. Trimming reads to remove lower quality ends can adversely affect alignment, especially if aligning to the genome and the read spans a splice site. Duplicated reads or just highly expressed genes? Most aligners can take quality scores into consideration. Currently, we do not recommend preprocessing reads aside from removing uniformly low quality samples. S

27 The problem with trimming all SE reads
100bp reads All reads trimmed to 75 bp Longer is better for splice junction spanning reads S

28 Quality Control: Resources
FASTX-Toolkit FastQC NGS Data Preprocessing S

29 RNA-Seq Work Flow Study Design Total RNA Sequencing Reads (SE or PE)
Aligned Reads Quantified isoform and gene expression KB

30 Alignment 101 100bp Read ACATGCTGCGGA Chr 3 ACATGCTGCGGA Chr 2 Chr 1
KB

31 The perfect read: 1 read = 1 unique alignment.
100bp Read ACATGCTGCGGA Chr 3 ACATGCTGCGGA Chr 2 Chr 1 KB

32 Some reads will align equally well to multiple locations. “Multireads”
100bp Read ACATGCTGCGGA ACATGCTGCGGA ACATGCTGCGGA 1 read 3 valid alignments Only 1 alignment is correct ACATGCTGCGGA KB

33 The worst case scenario.
100bp Read ACATGCTGCGGA ACATGCTGCGGA ACATGCTGCGGC 1 read 2 valid alignments Neither is correct ACATGCTGCGGA KB

34 Individual genetic variation may affect read alignment.
100bp Read ATATGCTGCGGA ACATGCTGCGGA ACATGCTGCGGC ACATGCTGCGGA KB

35 Aligning Billions of Short Sequence Reads
Gene A Gene B Aligners: Bowtie, GSNAP, BWA, MAQ, BLAT Designed to align the short reads fast, but not accurate KB

36 Aligning Sequence Reads
Gene A Gene B Gene family (orthologous/paralogous) Low-complexity sequence Alternatively spliced isoforms Pseudogenes Polymorphisms Indels Structural Variants Reference sequence Errors KB

37 Align to Genome or Transcriptome?
KB

38 Aligning to Reference Genome: Exon First Alignment
KB

39 Aligning to Reference Genome: Exon First Alignment
KB

40 Aligning to Reference Genome: Exon First Alignment
KB

41 Exon First Alignment: Pseudo-gene problem
Processed Pseudo-gene Exon 1 Exon 2 Exon 3 KB

42 Exon First Alignment: Pseudo-gene problem
Processed Pseudo-gene Exon 1 Exon 2 Exon 3 KB

43 Exon First Alignment: Pseudo-gene problem
Processed Pseudo-gene Exon 1 Exon 2 Exon 3 KB

44 Align to Genome or Transcriptome?
Advantages: Can align novel isoforms. Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes Transcriptome KB

45 Reference Transcriptome
Gene 1 Exon 1 Exon 2 Exon 3 Isoform 1 Isoform 2 Isoform 3 Gene 2 Exon 1 Exon 2 Exon 3 Exon 4 Isoform 1 Isoform 2 Isoform 3 KB

46 Align to Genome or Transcriptome?
Advantages: Can align novel isoforms. Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes Transcriptome Advantages: Easy, Focused to the part of the genome that is known to be transcribed. Disadvantages: Reads that come from novel isoforms may not align at all or may be misattributed to a known isoform. KB

47 Better Approach: Aligning to Transcriptome and Genome
Align to Transcriptome First Align the remaining reads to Genome next Advantages: relatively simpler, overcomes the pseudo-gene and novel isoform problems RUM, TopHat2, STAR KB Advantages: Can align novel isoforms. Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes

48 Conclusions There is no perfect aligner. Pick one well-suited to your application. E.g. Want to identify novel exons? Don’t align only to the known set of isoforms. Visually inspect the resulting alignments. Setting a parameter a little too liberal or conservative can have a huge effect on alignment. Consider running the same fastq files through multiple alignment pipelines specific to each application. Gene expression -> Bowtie to transcriptome Exon discovery -> RUM or other hybrid mapper Variant detection -> GSNAP, GATK, Samtools mpileup If your species has not been sequenced, use a de novo assembly method. Can also use the genome of a related species as a scaffold. KB

49 Output of most aligners: Bam/Sam file of reads and genome positions

50 Visualization of alignment data (BAM/SAM)
Genome browsers – UCSC, IGV, etc. Have students load IGV. Upload a BAM file to Sharepoint.

51 IGV is your friend. SNP Coverage density plot Read color = strand

52 Aligned Reads to Gene Abundance
Total RNA 100bp Reads Aligned Reads Quantified isoform and gene expression N

53 Aligned Reads to Gene Abundance: Challenges
Long Short Many approaches to quantify expression abundance N

54 Aligned Reads to Gene Abundance: Challenges
200 1000 reads 1 Long 100 2 Medium 50 3 Short Relative abundance for these genes, f1, f2, f3 N

55 Aligned Reads to Gene Abundance: Challenges
200 400 1 Long 100 400 2 Medium 50 200 3 Short Relative abundance for these genes, f1, f2, f3 N

56 Aligned Reads to Gene Abundance: Challenges
200 400 1 Long 100 400 2 Medium 50 200 3 Short Relative abundance; f1=0.2, f2=0.4, f3=0.4 N

57 Multireads: Reads Mapping to Multiple Genes/Transcripts
Long Medium N

58 Gene multireads Reads that map to multiple genomic locations Gene A
Gene B Gene C Reads that map to multiple genomic locations N

59 Isoform multireads Exon 1 Exon 2 Exon 3 Isoform 1 Isoform 2 Isoform 3

60 Multireads: Reads Mapping to Multiple Genes/Transcripts
200 350 1 Long 150 100 300 2 Medium Multireads 50 200 3 Short Unique Relative abundance for these genes, f1, f2, f3 N

61 Approach 1: Ignore Multireads
200 350 1 Long 150 100 300 2 Medium 50 200 3 Short Relative abundance for these genes, f1, f2, f3 Nagalakshmi et. al. Science. 2008 Marioni, et. al. Genome Research 2008 N

62 Approach 1: Ignore Multireads
200 350 1 Long 150 100 300 2 Medium 50 200 3 Short Over-estimates the abundance of genes with unique reads Under-estimates the abundance of genes with multireads Not an option at all, if interested in isoform expression N

63 Approach 2: Allocate Fraction of Multireads Using Estimates From Uniques
200 350 1 Long 150 100 300 2 Medium 50 200 3 Short Relative abundance for these genes, f1, f2, f3 Ali Mortazavi, et. al. Nature Methods 2008 RSEM,Cufflinks N

64 Multireads: Reads Mapping to Multiple Genes/Transcripts
Long Medium N

65 Conclusions for quantitation
Isoform-level estimates will become easier as read length increases. EM approaches are currently the best option. Bad alignment = bad quantitation Pseudogenes are problematic. Check your alignments -> adjust method. Iterative process. N

66 A speed bump on the road from raw counts to differential expression.
Normalization S

67 Large pool, small sample problem
Typical RNA library estimated to contain 2.4 x 1012 molecules. McIntyre et al 2011 Typical sequencing run = 25 million reads/sample. This means that only (1/1000th of a percent) of RNA molecules are sampled in a given run. High abundance transcripts are sampled more frequently. Example: Albumin = 13% of all reads in liver RNA-seq samples. Sampling errors affect low-abundance transcripts most. S

68 A finite pool of reads. S

69 Perfect world: All transcripts counted.
Sample 1 Sample 2 Perfect world: All transcripts counted. Alb Alb Albumin differentially expressed, but Low1 is not differentially expressed. Low1 Low1 S

70 Sample 1 Real world: More reads taken up by highly expressed genes means less reads available for lowly expressed genes. Alb Low1 S

71 Sample 1 Sample 2 Alb Alb Higher expressed Albumin in sample 2 sucks up the reads that would have gone to Low1, making Low1 appear to be differentially expressed between the samples. Highly expressed genes that are differentially expressed can cause lowly expressed genes that are not actually differentially expressed to appear that way. Low1 Low1 S

72 Normalization of raw counts
Wrong way to normalize data Normalizing to the total number of mapped reads (e.g. FPKM). Top 10 highly expressed genes soak up 20% of reads in the liver. FPKM is widely used, and problematic. Better ways to measure data Normalize to upper quartile (75th %) of non-zero counts, median of scaled counts (DESeq), or the weighted trimmed mean of the log expression ratios (EdgeR). S

73 Normalizing for gene length: Longer transcripts will produce relatively more reads when fragmented.
Sample 1 Long Gene Short Gene Only necessary if you are comparing the relative expression levels of two genes in one sample. TPM = Transcripts per million. S

74 Summary RNA ATGCTCA AGCTA TAGATGCTCA AGCTA ATGCTCA AGCTAATC ATGCTCA AGCTA AGTAGATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA TAGATGCTCA AGCTAATC CTCA AGCTAATCCTAG As sequences get longer, alignment and isoform quantitation becomes easier!

75 Summary RNA QC and Analysis Pipeline Experimental Design ATGCTCA AGCTA
TAGATGCTCA AGCTA ATGCTCA AGCTAATC ATGCTCA AGCTA AGTAGATGCTCA AGCTA QC and Analysis Pipeline Experimental Design ATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA TAGATGCTCA AGCTAATC CTCA AGCTAATCCTAG

76 Resources Transcript Abundance Aligner
Bowtie 2 GSNAP Tophat Transcript Abundance Cufflinks IsoEM RSEM Miso Cuffdiff DESeq edgeR Denovo Assembly Trinity

77 Acknowledgements Narayanan Raghupathy, KB Choi Gary Churchill
Ron Korstanje/ Karen Svenson/ Elissa Chesler Joel Graber Anuj Srivastava Churchill Lab – Dan Gatti Al Simons and Matt Hibbs Lisa Somes, Steve Ciciotte, mouse room staff Mice

78 Thank you!

79

80

81

82 Differential Expression
Over-estimation of Too conservative Under-estimation of Too sensitive (Many false positives) t_g = \dfrac{\hat{\mu}_{g,1}-\hat{\mu}_{g,2}} {\sqrt{\dfrac{\hat{\sigma}^2_{g,1}}{N_1}+ \dfrac{\hat{\sigma}^2_{g,2}}{N_2} } Expression abundance tends to overdispersed in RNA-seq measurements. Simple Poisson model would result in under-estimation of variance and many false positive calls.

83 Differential Expression
A term that addresses over-dispersion Poisson The majority of DE detection methods are trying to estimate this guy in their own way. (ex) DESeq, edgeR, EBSeq, DSS, NBPSeq, baySeq, etc.

84 Multiple Testing Correction and False Discovery rate
XKCD Significant 2012 IgNobel prize in Neuroscience for “finding Brain activity signal in dead salmon using fMRI” N

85 Northern Blot


Download ppt "RNA-seq: From experimental design to gene expression."

Similar presentations


Ads by Google