Presentation is loading. Please wait.

Presentation is loading. Please wait.

RNA-seq: Quantifying the Transcriptome

Similar presentations


Presentation on theme: "RNA-seq: Quantifying the Transcriptome"— Presentation transcript:

1 RNA-seq: Quantifying the Transcriptome
Alisha Holloway, PhD Gladstone Bioinformatics Core Director 12

2 What is RNA-seq? Use of high-throughput sequencing technologies to assess the RNA content of a sample.

3 Why do an RNA-seq experiment?
Detect differential expression Assess allele-specific expression Quantify alternative transcript usage Discover novel genes/transcripts, gene fusions Profile transcriptome Ribosome profiling to measure translation Between two or more groups – tumor/control, wildtype/ko, drug/control

4 Why do an RNA-seq experiment?
Detect differential expression Assess allele-specific expression Quantify alternative transcript usage Discover novel genes/transcripts, gene fusions Profile transcriptome Ribosome profiling to measure translation Appx 1/3 of SNVs not in dbSNP. More data on human variation every day – two papers in Science last week on rare variants. 1. Li, G. et al. Identification of allele-specific alternative mRNA processing via transcriptome sequencing. Nucleic Acids Res. (2012).doi: /nar/gks280 2. Rozowsky, J. et al. AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol. Syst. Biol. 7, 522 (2011). 3. Skelly, D. A., Johansson, M., Madeoy, J., Wakefield, J. & Akey, J. M. A powerful and flexible statistical framework for testing hypotheses of allele-specific gene expression from RNA-seq data. Genome Res. 21, 1728–1737 (2011). Skelly et al. 2011

5 Why do an RNA-seq experiment?
Detect differential expression Assess allele-specific expression Quantify alternative transcript usage Discover novel genes/transcripts, gene fusions Profile transcriptome Ribosome profiling to measure translation

6 Why do an RNA-seq experiment?
Detect differential expression Assess allele-specific expression Quantify alternative transcript usage Discover novel genes/transcripts, gene fusions Profile transcriptome Ribosome profiling to measure translation

7 Why do an RNA-seq experiment?
Detect differential expression Assess allele-specific expression Quantify alternative transcript usage Discover novel genes/transcripts, gene fusions Profile transcriptome Ribosome profiling to measure translation Pluripotent Stem Cell Cardiomyocytes Cardiogenic Mesoderm Cardiac Precursors During developmental time course How does gene expression change during differentiation of stem cells into cardiomyocytes? or compare tissues.

8 Why do an RNA-seq experiment?
Detect differential expression Assess allele-specific expression Quantify alternative transcript usage Discover novel genes/transcripts, gene fusions Profile transcriptome Ribosome profiling to measure translation More tomorrow! Most RNA-seq studies measure the amount of transcript, but not all transcripts are translated into protein. Can monitor translation by sequencing the ribosome-protected mRNA fragments, which are about 30 nt long. With deep enough sequencing can estimate the amount of transcript being transcribed. Compare to estimate of transcript abundance to determine the relationship between transcription and translation. Can also identify regions of ribosomal pausing, which is a mechanism of translation control that may allow for proper protein secondary structures to form. Ingolia et al. 2009, Weissman Lab

9 RNA-seq Microarray ID novel genes, transcripts, & exons
Well vetted QC and analysis methods Greater dynamic range Well characterized biases Less bias due to genetic variation Quick turnaround from established core facilities Repeatable Currently less expensive No species-specific primer/probe design More accurate relative to qPCR Many more applications Greater dynamic range – fluorescence vs direct count of reads Genetic variation – 1 mismatch can lead to up to 30% difference in signal intensity Go to image for comparison of accuracy New applications: Assess allele-specific expression Discover novel genes/transcripts, gene fusions Ribosome profiling to measure translation

10 RNA-seq vs. Affy RNA-seq vs. Taqman Marioni et al. 2008 © 2010 NuGen
Slope > 1 indicates truncation at extremes. Marioni et al. 2008 © 2010 NuGen

11 Illumina Pac-Bio Read length 100 bp paired end 2500 bp avg Throughput
200 million read pairs/lane 1 million reads/ SMRT cell Error rate <1% 15% total, most are indels, 4% SNP Cost $600/sample $7-8k/sample Accessibility USCF, UC-Davis, BGI No commercially available protocols Uses DE, ASE, quant alt. transc. usage Characterize transcriptome Which technology to use? DE & ASE – need depth Alt transc – if you want to quantify the usage within or between samples, use deep seq. If you just want qualitative assessment – Pac-Bio better Char transcriptome – if you don’t have a good annotation or a ref genome sequence Both technologies allow you to - Discover novel genes/transcripts, gene fusions

12 When to use Pac-Bio Short read sequencing using Illumina or SOLiD technology definitely most common, but there are cases when one might want to measure the entire transcript. Three exons can be joined by: 1. one end of a pair mapping to exon 1 2. other end spanning exons 2 & 3 Can’t join because exon 3 longer than insert size

13 Plan it well. Experimental design Read depth Barcoding Read length
Biological replicates Reference genome? Good gene annotation? Read depth Barcoding Read length Paired vs. single-end Because short read RNA-seq is more commonly used, more readily available, we’ll focus on that. When RNA-seq was first being done people said that you didn’t need to do replicates because you’d be getting a digital readout of the RNA in a sample. People have examined the technical variation in extracting RNA, making cDNA, library preps, and sequencing lane effects. Those effects are minimal compared to biological variation. We recommend n=3 biological replicates or more if you’re sequencing from human tissues. Biological variation Technical variation

14 Plan it well. Experimental design Read depth Barcoding Read length
Biological replicates Reference genome? Good gene annotation? Read depth Barcoding Read length Paired vs. single-end Do you want to estimate transcript level abundance or gene level? Mammals (and other vertebrates) have a lot of tissue or developmental stage specific transcript usage. To estimate transcript level abundance you need a good annotation and a reference genome and very deep sequencing: 1. to ensure that you cover splice junctions 2. have the power to distinguish between very similar transcripts Several tools that use sophisticated algorithms to assign reads to specific transcripts. Talk more about this later, but the point is that if you want to estimate transcript level abundance for a vertebrate sample, you need very deep sequencing.

15 Plan it well. Experimental design Read depth Barcoding Read length
Biological replicates Reference genome? Good gene annotation? Read depth Barcoding Read length Paired vs. single-end Gene abundance Transcript abundance 50 million read pairs for eukaryotes Can barcode and run 4 per lane

16 How much data do we need? ~15-20K genes expressed in a tissue | cell line. Genes are on average 3KB For 1x coverage using 100 bp reads, would need 600K sequence reads In reality, we need MUCH higher coverage to accurately estimate gene expression levels. 50 million reads For transcript abundance of lowly expressed genes, may need even higher. Jury still out.

17 Plan it well. Experimental design Read depth Barcoding Read length
Biological replicates Reference genome? Good gene annotation? Read depth Barcoding Read length Paired vs. single-end Gene abundance Transcript abundance 50 million read pairs for eukaryotes Can barcode and run 4 per lane 200 million reads / lane Run 4 samples / lane

18 Plan it well. Experimental design Read depth Barcoding Read length
Biological replicates Reference genome? Good gene annotation? Read depth Barcoding Read length Paired vs. single-end Uniq seq = 4read length Read length Unique seq 25 1.1x1015 50 1.3x1030 100 1.6x1060 What proportion of reads map when they’re 25 vs 50 bases long? Can estimate, but better to test on real data with real biases (GC bias, paralogs, protein domains). We’re in the process of getting this for real data. ~60 million coding bases in vertebrate genome

19 Plan it well. Experimental design Read depth Barcoding Read length
Biological replicates Reference genome? Good gene annotation? Read depth Barcoding Read length Paired vs. single-end Paired-end! Effectively doubles read length – huge impact on read mapping Increases number of splice junction spanning reads Critical for estimating transcript-level abundance 50 bp single end 6.25 million unique, square that Single end – only splice junction spanning reads can connect exons paired end – reads in two exons informative as well Have to be able to connect exons! Depth is important, but knowing that three exons are expressed together is critical.

20 The wet lab side…briefly
OK, you’ve planned a great experiment, now what. Extract RNA, RT to get cDNA. There are several QC steps along the way to make sure you’ve got good samples (RNA quality, quantity), but I won’t talk more about the wet lab side today.

21 How do you make sense of this pile of data?
QC Alignment Expt: Compare two groups Transcript Assignment & Abundance Differential Expression Expt: Allele-specific expression You will possibly have somewhere on the order of several hundred GB of data.

22 QC FastQC - http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Proportion of reads that mapped uniquely Remove duplicates; likely due to PCR amp. Assess ribosomal RNA content Assess content of possible contaminants – human RNA (if not human samples), Mycoplasma (if cell lines) file:///Users/alisha/Documents/teaching/do96_HNF4a_liver_ARP31046_mmuC57BL612_SAN03.fq_fastqc/fastqc_report.html

23 Then what? Align reads to the genome Easy(ish) for genomic sequence
Difficult for transcripts with splice junctions

24 Alignment Algorithms Burrows-Wheeler Transform Smith-Waterman
Bowtie (Langmead et al 2009) BWA (Li and Durbin 2009) SOAP2 (Li et al. 2009) Smith-Waterman BFAST (Homer at al. 2009, based on BLAT) – multiple indexes, finds candidate alignment locations using seed and extend, followed by a gapped Smith-Waterman local alignment for each candidate BWT is a compression method that reduces memory footprints. Wikipedia article has a very good explanation Smith-Waterman – matrix of scores for all pairwise nucleotide comparisons between two sequences, follow the path through the matrix that optimizes the similarity measure; guaranteed to find local optimal alignment Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009;25: Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 2009;25: Homer N, Merriman B, Nelson SF. BFAST: an alignment tool for large scale genome resequencing. PLoS ONE 2009;4:e7767. Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008;18:

25 Alignment tools for splice junction mapping
Tophat MapSplice SpliceMap HMMsplicer Many reads map to contiguous genome sequence, but others are split between two exons.

26 Tophat Map reads to transcriptome using Bowtie
Map to genome to discover novel exons or start here if no annotation available Split reads to smaller segments; map to genome to discover novel splice junctions Report best alignment for each read Tophat2 uses different protocol and algorithms than described in 2009 paper. Trapnell et al. Bioinformatics 2009; Trapnell et al. Nature Protocols 2012

27 MapSplice & SpliceMap Tag alignment (user chooses aligner)
Break reads into segments Map reads Unmapped segments considered for splice junction mapping based on location of partner segment Merge segments from read for final alignment Assess splice junction quality An overview of the MapSplice pipeline. The algorithm contains two phases: tag alignment (Step 1–Step 4) and splice inference (Step 5–Step 6). In the ‘tag alignment' phase, candidate alignments of the mRNA tags to the reference genome are determined. In the ‘splice inference' phase, splice junctions that appear in one or more tag alignments are analyzed to determine a splice significance score based on the quality and diversity of alignments that include the splice. Ambiguous candidate alignments are resolved by selecting the alignment with the overall highest quality match and highest confidence splice junctions. Wang et al doi: /nar/gkq622 Kin Fai Au, Hui Jiang, Lan Lin, Yi Xing, and Wing Hung Wong Detection of splice junctions from paired-end RNA-seq data by SpliceMap Nucleic Acids Research, Advance access published on April 5, 2010. Wang et al. NAR 2010, Au et al. NAR 2010

28 HMMsplicer Remove reads that map contiguously
Hidden markov model to detect exon boundary of remaining reads Compute intensive Reference annotation not used Best for compact genomes User sets threshold for accepting splice junction. Dimon et al. PLoS One 2010

29 HMMsplicer HMMSplicer begins by dividing the remaining reads in half and aligning each half to the genome. All alignments for both read halves are considered autonomously and are not resolved until the final scoring step. Once a read-half is aligned, a Hidden Markov Model (HMM) is used to detect the most probable splice position. The HMM is trained on a subset of readhalf alignments to best reflect the quality and base composition of the dataset and genome. Next, the remaining portion of the read is aligned downstream of the exon-intron boundary, completing the junction definition. Finally, identical junctions are collapsed into a single junction and all junctions are scored, filtered by score, and divided by splice-site edges, with canonical (GT-AG and GC-AG) junctions in one result set and non-canonical edges in a second result set.

30 Transcript Assignment/Abundance
Problem is trying to assign reads to particular transcripts when they share several exons in a row. Martin & Wang, Nature Reviews Genetics 2011

31 Transcript Assignment &|Abundance Tools
For DE: Cufflinks MISO Scripture – not maintained De novo assembly Trans-ABySS Trinity Maker Methods for alignment are fairly stable, but assigning reads to transcripts and estimating transcript abundance is really difficult and is a hot area of research right now. As read lengths increase this will become less of a distraction, but until then we need good methods for assigning reads to transcripts and estimating abundance. I hesitate to include papers for these tools because the original papers describing them are already out of date with the latest versions of the software.

32 Cufflinks Constructs the parsimonious set of transcripts that explain the reads observed. Basically, finds a minimum path cover on the DAG. Derives a likelihood for the abundances of a set of transcripts given a set of fragments. FPKM – fragments per kb of exon per million fragments mapped. Trapnell, Pachter

33 MISO Mixture of Isoforms
Bayesian – treats expression level of set of isoforms as random variable and estimates a distribution over the values of this variable. Gives confidence intervals for expression estimates and measures of DE as Bayes factors Burge MIT

34 Bias Correction and Normalization
Random hexamer bias (Hansen et al. 2010) From PCR or RT primers Reestimate FPKM or read counts based on bias Upper quartile normalization (Bullard et al. 2010) excellent resource for comparison to qPCR and microarray as well as methods of normalization of RNA-seq data After assigning reads to transcripts and estimating abundance, apply bias correction and normalization.

35 Differential Expression
Goal: determine whether observed difference in read counts is greater than would be expected due to random variation. If reads independently sampled from population, reads would follow multinomial distribution appx by Poisson Poisson good appx of binomial if n >= 20 and p <= 0.05 (var for binomial is np(1-p)) Single parameter lambda which is determined by mean, variance and other properties (e.g., distribution) follows K is trials Discrete prob dist of the number of successes in a sequence of n independent trials, each of which yields prob success, p

36 Differential Expression
BUT! We know that the count data show more variance than expected Overdipersion problem mitigated by using the negative binomial distribution, which is determined by mean and variance The real trick is estimating the variance as you will soon see. Sample j, gene i

37 Differential Expression
Binomial test Old Cuffdiff Negative binomial DESeq – estimate variance using all genes with similar expression levels Cuffdiff – sim to DESeq, but incorp fragment assignment uncertainty simultaneously EdgeR - moderate variance over all genes T-test

38 Differential Expression
Old cuffdiff

39 Some biology, finally? How have gene expression patterns have changed during the course of differentiation? Which genes are specific to certain cell types? What can we learn about what those co-expressed genes do? After test for DE and adjusting p-values for multiple tests (FDR), we can get to some biology.

40 Clusters of co-expressed genes
Use unsupervised clustering to group genes by expression pattern Use gene ontology information to determine which kinds of genes are in each group Reveal novel associations and gene types Use stat test to determine which genes are differentially expressed. Of those that are DE, want to group them by expression pattern.

41 Clusters of co-expressed genes
Pluripotency/stem cell: Nanog, Oct4 Mesoderm/cell fate commitment: Mesp1, Eomes The ontology terms associated with the genes make sense for the stage of expression. Can learn about new associations in co-expressed clusters – some of the genes in the cluster have not been associated with particular stages. Can begin to make new connections/pathways/networks to understand cardiac development. Cardiac precursors: Isl1, Mef2c, Wnt2 Cardiac structure/function: Actc1, Ryr2, Tnni3

42 Thanks for listening! Alisha Holloway Gladstone Institutes Bioinformatics Core


Download ppt "RNA-seq: Quantifying the Transcriptome"

Similar presentations


Ads by Google