Introduction to RNAseq

Slides:



Advertisements
Similar presentations
RNA-seq library prep introduction
Advertisements

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
RNAseq.
12/04/2017 RNA seq (I) Edouard Severing.
Peter Tsai Bioinformatics Institute, University of Auckland
DEG Mi-kyoung Seo.
RNA-seq: the future of transcriptomics ……. ?
RNAseq analysis Bioinformatics Analysis Team
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
RNA-seq Analysis in Galaxy
RNAseq Applications in Genome Studies
mRNA-Seq: methods and applications
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Before we start: Align sequence reads to the reference genome
RNA-Seq and RNA Structure Prediction
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Introduction to RNA-Seq and Transcriptome Analysis
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Todd J. Treangen, Steven L. Salzberg
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
RNAseq analyses -- methods
Introduction to RNA-Seq & Transcriptome Analysis
Next Generation DNA Sequencing
Schedule change Day 2: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq) Day 2: PM - RNA-Seq practical (Tophat + Cuffdiff pipeline on Galaxy) Day.
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
RNA-Seq Analysis Simon V4.1.
Transcriptome Analysis
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Introduction to RNA-Seq
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
The iPlant Collaborative
RNA-seq: Quantifying the Transcriptome
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
No reference available
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
Transcriptomics History and practice.
Simon v RNA-Seq Analysis Simon v
Introductory RNA-seq Transcriptome Profiling
RNA Quantitation from RNAseq Data
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
High-Throughput Analysis of Genomic Data [S7] ENRIQUE BLANCO
Kallisto: near-optimal RNA seq quantification tool
Transcriptome analysis
Transcriptomics History and practice.
Additional file 2: RNA-Seq data analysis pipeline
Sequence Analysis - RNA-Seq 2
Transcriptomics – towards RNASeq – part III
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

Introduction to RNAseq

NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone FASTQ is accepted format standard Must assess quality scores before proceeding ‘Bad’ data can be rescued

The Central Dogma of Molecular Biology Reverse Transcription

RNAseq Protocols cDNA, not RNA sequencing Types of libraries available: Total RNA sequencing (not advised) polyA+ RNA sequencing Small RNA sequencing (specific size range targeted)

cDNA Synthesis

Genome-scale Applications Transcriptome analysis Identifying new transcribed regions Expression profiling Resequencing to find genetic polymorphisms: SNPs, micro-indels CNVs Question: Why even bother with exome sequencing then?

What about microarrays??!!! Assumes we know all transcribed regions and that spliceforms are not important Cannot find anything novel BUT may be the best choice depending on QUESTION

Arrays vs RNAseq (1) Correlation of fold change between arrays and RNAseq is similar to correlation between array platforms (0.73) Technical replicates almost identical Extra analysis: prediction of alternative splicing, SNPs Low- and high-expressed genes do not match

RNA-Seq promises/pitfalls can reveal in a single assay: new genes splice variants quantify genome-wide gene expression BUT Data is voluminous and complex Need scalable, fast and mathematically principled analysis software and LOTS of computing resources

Experimental considerations Comparative conditions must make biological sense Biological replicates are always better than technical ones Aim for at least 3 replicates per condition ISOLATE the target mRNA species you are after

Analysis strategies De novo assembly of transcripts: + re-constructs actual spliced transcripts + does not require genome sequence easier to work post-transcriptional modifications - requires huge computational resources (RAM) - low sensitivity: hard to capture low abundance transcripts Alignment to the genome => Transcript assembly + computationally feasible + high sensitivity + easier to annotate using genomic annotations - need to take special care of splice junctions

Basic analysis flowchart Illumina reads Remove artifacts AAA..., ...N... Clip adapters (small RNA) "Collapse" identical reads Align to the genome Pre-filter: low complexity synthetic Count and discard Re-align with different number of mismatches etc un-mapped mapped mapped un-mapped Assemble: contigs (exons) + connectivity Filter out low confidence contigs (singletons) Annotate

Software Short-read aligners Data preprocessing Expression studies BWA, Novoalign, Bowtie, TOPHAT (eukaryotes) Data preprocessing Fastx toolkit, samtools Expression studies Cufflinks package, R packages (DESeq, edgeR, more…) Alternative splicing Cufflinks, Augustus

Very widely adopted suite The ‘Tuxedo’ protocol TOPHAT + CUFFLINKS TopHat aligns reads to genome and discovers splice sites Cufflinks predicts transcripts present in dataset Cuffdiff identifies differential expression Very widely adopted suite

Read alignment with TopHat Uses BOWTIE aligner to align reads to genome BOWTIE cannot deal with large gaps (introns) Tophat segments reads that remain unaligned Smaller segments mostly end up aligning

Read alignment with TopHat (2)

Read alignment with TopHat (3) When there is a large gap between segments of same read -> probable INTRON Tophat uses this to build an index of probable splice sites Allows accurate measurement of spliceform expression

Cufflinks package http://cufflinks.cbcb.umd.edu/ Cufflinks: Cuffdiff: Expression values calculation Transcripts de novo assembly Cuffdiff: Differential expression analysis

Cufflinks: Transcript assembly Assembles individual transcripts based on aligned reads Infers likely spliceforms of each gene Quantifies expression level of each

Cuffmerge Merges transfrags into transcripts where appropriate Also performs a reference based assembly of transcripts using known transcripts Produces single annotation file which aids downstream analysis

Cuffdiff: Differential expression Calculates expression level in two or more samples Expression level relates to read abundance Because of bias sources, cuffdiff tries to model the variance in its significance calculation

FPKM (RPKM): Expression Values Fragments Reads Per Kilobase of exon model per Million mapped fragments Nat Methods. 2008, Mapping and quantifying mammalian transcriptomes by RNA-Seq. Mortazavi A et al. C= the number of reads mapped onto the gene's exons N= total number of reads in the experiment L= the sum of the exons in base pairs.

Cuffdiff (differential expression) Pairwise or time series comparison Normal distribution of read counts Fisher’s test test_id gene locus sample_1 sample_2 status value_1 value_2 ln(fold_change) test_stat p_value significant ENSG00000000003 TSPAN6 chrX:99883666-99894988 q1 q2 NOTEST 0 0 0 0 1 no ENSG00000000005 TNMD chrX:99839798-99854882 q1 q2 NOTEST 0 0 0 0 1 no ENSG00000000419 DPM1 chr20:49551403-49575092 q1 q2 NOTEST 15.0775 23.8627 0.459116 -1.39556 0.162848 no ENSG00000000457 SCYL3 chr1:169631244-169863408 q1 q2 OK 32.5626 16.5208 -0.678541 15.8186 0 yes

Recommendations You can use BOWTIE or BOWTIE2 but Use CUFFDIFF2 Better statistical model Detection of truly differentially expressed genes VERY easy to parse output file (See example on course page)