Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to RNAseq

Similar presentations


Presentation on theme: "Introduction to RNAseq"— Presentation transcript:

1 Introduction to RNAseq

2 NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone FASTQ is accepted format standard Must assess quality scores before proceeding ‘Bad’ data can be rescued

3 The Central Dogma of Molecular Biology
Reverse Transcription

4 RNAseq Protocols cDNA, not RNA sequencing
Types of libraries available: Total RNA sequencing (not advised) polyA+ RNA sequencing Small RNA sequencing (specific size range targeted)

5 cDNA Synthesis

6 Genome-scale Applications
Transcriptome analysis Identifying new transcribed regions Expression profiling Resequencing to find genetic polymorphisms: SNPs, micro-indels CNVs Question: Why even bother with exome sequencing then?

7 What about microarrays??!!!
Assumes we know all transcribed regions and that spliceforms are not important Cannot find anything novel BUT may be the best choice depending on QUESTION

8 Arrays vs RNAseq (1) Correlation of fold change between arrays and RNAseq is similar to correlation between array platforms (0.73) Technical replicates almost identical Extra analysis: prediction of alternative splicing, SNPs Low- and high-expressed genes do not match

9 RNA-Seq promises/pitfalls
can reveal in a single assay: new genes splice variants quantify genome-wide gene expression BUT Data is voluminous and complex Need scalable, fast and mathematically principled analysis software and LOTS of computing resources

10 Experimental considerations
Comparative conditions must make biological sense Biological replicates are always better than technical ones Aim for at least 3 replicates per condition ISOLATE the target mRNA species you are after

11 Analysis strategies De novo assembly of transcripts:
+ re-constructs actual spliced transcripts + does not require genome sequence easier to work post-transcriptional modifications - requires huge computational resources (RAM) - low sensitivity: hard to capture low abundance transcripts Alignment to the genome => Transcript assembly + computationally feasible + high sensitivity + easier to annotate using genomic annotations - need to take special care of splice junctions

12 Basic analysis flowchart
Illumina reads Remove artifacts AAA..., ...N... Clip adapters (small RNA) "Collapse" identical reads Align to the genome Pre-filter: low complexity synthetic Count and discard Re-align with different number of mismatches etc un-mapped mapped mapped un-mapped Assemble: contigs (exons) + connectivity Filter out low confidence contigs (singletons) Annotate

13 Software Short-read aligners Data preprocessing Expression studies
BWA, Novoalign, Bowtie, TOPHAT (eukaryotes) Data preprocessing Fastx toolkit, samtools Expression studies Cufflinks package, R packages (DESeq, edgeR, more…) Alternative splicing Cufflinks, Augustus

14 Very widely adopted suite
The ‘Tuxedo’ protocol TOPHAT + CUFFLINKS TopHat aligns reads to genome and discovers splice sites Cufflinks predicts transcripts present in dataset Cuffdiff identifies differential expression Very widely adopted suite

15

16 Read alignment with TopHat
Uses BOWTIE aligner to align reads to genome BOWTIE cannot deal with large gaps (introns) Tophat segments reads that remain unaligned Smaller segments mostly end up aligning

17 Read alignment with TopHat (2)

18 Read alignment with TopHat (3)
When there is a large gap between segments of same read -> probable INTRON Tophat uses this to build an index of probable splice sites Allows accurate measurement of spliceform expression

19 Cufflinks package http://cufflinks.cbcb.umd.edu/ Cufflinks: Cuffdiff:
Expression values calculation Transcripts de novo assembly Cuffdiff: Differential expression analysis

20 Cufflinks: Transcript assembly
Assembles individual transcripts based on aligned reads Infers likely spliceforms of each gene Quantifies expression level of each

21 Cuffmerge Merges transfrags into transcripts where appropriate
Also performs a reference based assembly of transcripts using known transcripts Produces single annotation file which aids downstream analysis

22 Cuffdiff: Differential expression
Calculates expression level in two or more samples Expression level relates to read abundance Because of bias sources, cuffdiff tries to model the variance in its significance calculation

23 FPKM (RPKM): Expression Values
Fragments Reads Per Kilobase of exon model per Million mapped fragments Nat Methods. 2008, Mapping and quantifying mammalian transcriptomes by RNA-Seq. Mortazavi A et al. C= the number of reads mapped onto the gene's exons N= total number of reads in the experiment L= the sum of the exons in base pairs.

24 Cuffdiff (differential expression)
Pairwise or time series comparison Normal distribution of read counts Fisher’s test test_id gene locus sample_1 sample_2 status value_1 value_2 ln(fold_change) test_stat p_value significant ENSG TSPAN6 chrX: q1 q2 NOTEST no ENSG TNMD chrX: q1 q2 NOTEST no ENSG DPM1 chr20: q1 q2 NOTEST no ENSG SCYL3 chr1: q1 q2 OK yes

25 Recommendations You can use BOWTIE or BOWTIE2 but Use CUFFDIFF2
Better statistical model Detection of truly differentially expressed genes VERY easy to parse output file (See example on course page)


Download ppt "Introduction to RNAseq"

Similar presentations


Ads by Google