Transcriptomics – towards RNASeq – part III

Slides:

Advertisements

Similar presentations

RNA-Seq based discovery and reconstruction of unannotated transcripts

Advertisements

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy

Peter Tsai Bioinformatics Institute, University of Auckland

DEG Mi-kyoung Seo.

TOPHAT Next-Generation Sequencing Workshop RNA-Seq Mapping

Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Transcriptomics Jim Noonan GENE 760.

MCB Lecture #21 Nov 20/14 Prokaryote RNAseq.

RNA-seq Analysis in Galaxy

Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani

Before we start: Align sequence reads to the reference genome

NGS Analysis Using Galaxy

An Introduction to RNA-Seq Transcriptome Profiling with iPlant

RNA-Seq Visualization

Introduction to RNA-Seq and Transcriptome Analysis

Expression Analysis of RNA-seq Data

Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.

Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.

BIF Group Project Group (A)rabidopsis: David Nieuwenhuijse Matthew Price Qianqian Zhang Thijs Slijkhuis Species: C. Elegans Project: Advanced.

RNAseq analyses -- methods

Introduction to RNA-Seq & Transcriptome Analysis

Schedule change Day 2: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq) Day 2: PM - RNA-Seq practical (Tophat + Cuffdiff pipeline on Galaxy) Day.

TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.

RNA-Seq Analysis Simon V4.1.

Transcriptome Analysis

Next Generation Sequencing. Overview of RNA-seq experimental procedures. Wang L et al. Briefings in Functional Genomics 2010;9: © The Author.

RNA-Seq in Galaxy Igor Makunin QAAFI, Internal Workshop, April 17, 2015.

An Introduction to RNA-Seq Transcriptome Profiling with iPlant.

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

Introduction to RNA-Seq

The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop RNA-Seq using the Discovery Environment And COGE.

The iPlant Collaborative

RNA-Seq Assembly 转录组拼接唐海宝基因组与生物技术研究中心 2013 年 11 月 23 日.

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

Introduction to RNAseq

The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop RNA-Seq visualization with cummeRbund.

Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.

The iPlant Collaborative

An Introduction to RNA-Seq Transcriptome Profiling with iPlant (

The iPlant Collaborative

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -

Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.

Canadian Bioinformatics Workshops

RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.

Introductory RNA-seq Transcriptome Profiling

GCC Workshop 9 RNA-Seq with Galaxy

WS9: RNA-Seq Analysis with Galaxy (non-model organism )

Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017

RNA-Seq analysis in R (Bioconductor)

S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.

How to store and visualize RNA-seq data

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Introductory RNA-Seq Transcriptome Profiling

Kallisto: near-optimal RNA seq quantification tool

Transcriptome analysis

Maximize read usage through mapping strategies

Assessing changes in data – Part 2, Differential Expression with DESeq2

Additional file 2: RNA-Seq data analysis pipeline

Sequence Analysis - RNA-Seq 2

Introduction to RNA-Seq & Transcriptome Analysis

RNA-Seq Data Analysis UND Genomics Core.

Presentation transcript:

Transcriptomics – towards RNASeq – part III Federico M. Giorgi – federico.giorgi@gmail.com Analisi del Genoma e Bioinformatica Corso di Laurea Specialistica in Biotecnologie delle piante e degli animali

Overview of the course Transcription and Transcriptomics Day 1 23/04/2012 Room β3 Transcriptomics methods: Microarrays Exercises on Microarray analysis RNASeq Day 2 02/05/2012 Room 35 Further Applications of RNASeq Day 3 07/05/2012 Room β3 RNASeq exercises

Differential Expression in RNASeq The pipeline for performing Differential Expression of all genes between samples can be summarized into five steps: Map reads onto a reference genome or transcriptome Summarize: i.e. merge counts at a gene-, exon-, isoform- level Normalize read counts (optional) Calculate fold change, p-values, ranking, differential expression Interprete the results (Systems Biology)

Differential Expression in RNASeq The pipeline for performing Differential Expression of all genes between samples can be summarized into five steps: Map reads onto a reference genome or transcriptome Filter reads Tophat (Spliced exon first aligner) Summarize: i.e. merge counts at a gene-, exon-, isoform- level Normalize read counts (optional) Cufflinks Calculate fold change, p-values, ranking, differential expression Interprete the results (Systems Biology)

Differential Expression in RNASeq In real cases, raw reads are filtered prior to mapping: Generate reads via deep sequencing (e.g. Illumina)

Differential Expression in RNASeq In real cases, raw reads are filtered prior to mapping: Generate reads via deep sequencing (e.g. Illumina) Filter reads

Differential Expression in RNASeq In real cases, raw reads are filtered prior to mapping: Remove contaminant reads - Bowtie aligner vs. NCBI-BLAST sequence database - rNA (Vezzi et al., 2001) Generate reads via deep sequencing (e.g. Illumina) Filter reads

Differential Expression in RNASeq In real cases, raw reads are filtered prior to mapping: Remove contaminant reads - Bowtie aligner vs. NCBI-BLAST sequence database - rNA (Vezzi et al., 2001) Trim reads for quality - Trimmomatic - rNA Generate reads via deep sequencing (e.g. Illumina) Filter reads

Differential Expression in RNASeq In real cases, raw reads are filtered prior to mapping: Remove contaminant reads - Bowtie aligner vs. NCBI-BLAST sequence database - rNA (Vezzi et al., 2001) Trim reads for quality - Trimmomatic - rNA Correct reads by kmer-frequency* - Soap corrector - Allpaths-LG corrector (within Trinity) Generate reads via deep sequencing (e.g. Illumina) Filter reads *Seldom used in quantitative analysis Mostly used in Transcriptome reconstruction based on unnormalized libraries

Differential Expression in RNASeq In real cases, raw reads are filtered prior to mapping: Remove contaminant reads - Bowtie aligner vs. NCBI-BLAST sequence database - rNA (Vezzi et al., 2001) Trim reads for quality - Trimmomatic - rNA Correct reads by kmer-frequency - Soap corrector - Allpaths-LG corrector (within Trinity) Generate reads via deep sequencing (e.g. Illumina) Filter reads Map reads onto a reference genome or transcriptome Tophat

A real example – contaminants in tomato genomic sequences (Solanum lycopersicum) 80% Completely new transcripts, or mistakes? 6% 4% Environmental samples, no species assigned Unknown species <0.5% Possible true contaminants <0.1% Chloroplasts Other «contaminants»: Mitochondria BLAST: assembled genomic contigsvs. NCBI nr/nt

Today: RNASeq excercises I see and I forget I read and I remember I do and I understand Confucius 551 B.C. – 479 B.C.

Alternative approaches Exercise of today Purpose: Differential Expression Analysis Tools: Tophat/Cufflinks software Alternative approaches «Red» way «Green» way Completely relies on the available annotation of the Transcriptome Use reads to identify novel junctions and/or transcripts

Use a transcript database to quantify transcript Workflow Raw reads format: fastQ Trimming (rNA) Trimmed reads format: fastQ Mapping (TopHat) Mappings format: SAM Use a transcript database to quantify transcript

Use a transcript database to quantify transcript Workflow Raw reads format: fastQ Trimming (rNA) Trimmed reads format: fastQ Mapping (TopHat) Mappings format: SAM Use a transcript database to quantify transcript «Red» way

Workflow Raw reads format: fastQ Trimming (rNA) «Green» way Trimming (rNA) Use reads to identify novel junctions and/or transcripts Trimmed reads format: fastQ Mapping (TopHat) Mappings format: SAM Use a transcript database to quantify transcript «Red» way

Workflow Raw reads format: fastQ Trimming (rNA) «Green» way Trimming (rNA) Use reads to identify novel junctions and/or transcripts Generate an updated transcript database Trimmed reads format: fastQ Mapping (TopHat) Mappings format: SAM Use a transcript database to quantify transcript «Red» way

Simple, treated vs. control with replicates: Experimental design Simple, treated vs. control with replicates: Treated A Control A vs. Treated B Control B Every sample has two fastQ files, because we are gonna work on paired reads This time, we don’t have a real dataset, like with microarrays, because a full RNASeq pipeline study may take days on a cluster of computers

Prepare the directory structure Open the terminal cd /home/ngs/data_crunching/RNAseq mkdir cuffcompare_dir mkdir cuffmerge_dir mkdir cuffdiff_red mkdir cuffdiff_green mkdir /home/ngs/data_crunching/sequences/rna/trimmed cd /home/ngs/data_crunching/sequences/rna

Look at the reads How long are these reads? cd /home/ngs/data_crunching/sequences/rna

Look at the reads How long are these reads? cd /home/ngs/data_crunching/sequences/rna head controlA_R1.fastq

Look at the reads How long are these reads? cd /home/ngs/data_crunching/sequences/rna head controlA_R1.fastq How many reads do we have?

(Common RNASeq fastQ file: 16,337,998 reads, >2Gbytes) Look at the reads How long are these reads? cd /home/ngs/data_crunching/sequences/rna head controlA_R1.fastq How many reads do we have? wc -l controlA_R1.fastq grep "@" controlA_R1.fastq | wc -l (Common RNASeq fastQ file: 16,337,998 reads, >2Gbytes)

We will use many for loops to save time: Trim the reads We will use many for loops to save time: for alldir in controlA controlB treatedA treatedB do echo "$alldir" is a cool directory done

Trim the reads for alldir in controlA controlB treatedA treatedB do rNA --filter-for-assembly --query1 "$alldir"_R1.fastq --query2 "$alldir"_R2.fastq --threads 1 --min-phred-value-CLC 20 --min-mean-phred-quality 20 --min-size 50 --output trimmed/"$alldir" done

Index the reference GENOME Go to the genome directory: cd /home/ngs/data_crunching/reference/ Create a GENOME index for the aligner (TopHat): bowtie-build Populus_minusculus.fa Populus_minusculus

Trapnell suite of software Trapnell, Nature Protocols, 2012

Trapnell suite of software Trapnell, Nature Protocols, 2012

Tophat Aligns RNA-Seq reads to a genome Able to identify exon-exon splice junctions Built on the short read aligner Bowtie Split reads alignment (for reads spanning two exons) http://tophat.cbcb.umd.edu/manual.html Or google it!! Read Reference Exon Intron Exon

Tophat alignment Change directory, point the annotation to the right file: cd /home/ngs/data_crunching/RNAseq annotation=/home/ngs/data_crunching/genes/Populus_minusculus.gff3 head $annotation reference=/home/ngs/data_crunching/reference/Populus_minusculus head "$reference".fa Align every read pair against the reference: for alldir in controlA controlB treatedA treatedB do mkdir "$alldir" mkdir "$alldir"/tophat tophat -o "$alldir"/tophat -G $annotation --max-multihits 10 --initial-read-mismatches 1 --segment-mismatches 0 --segment-length 25 $reference ../sequences/rna/trimmed/"$alldir"_1.fastq ../sequences/rna/trimmed/"$alldir"_2.fastq done

Some Tophat options -G: provide TopHat with a file of annotation. TopHat will initially map on the transcriptome and then map the remaining reads to the genome (exon-first spliced approach) -g/--max-multihits: ignore reads with more than this alignments. --initial-read-mismatches: max number of mismatches for read alignment --segment-mismatches: max number of mismatches for split segments --segment-length: length of each segment in which reads are splitted for alignment to the genome

First approach: red way Completely relies on the available annotation of the Transcriptome

Run Cuffdiff Cuffdiff is one of the Cufflinks sub-packages, for simple Differential Expression Analysis cd /home/ngs/data_crunching/RNAseq annotation=/home/ngs/data_crunching/genes/Populus_minusculus.gff3 cuffdiff -o cuffdiff_red/ -L Control,Treated $annotation controlA/tophat/accepted_hits.bam,controlB/tophat/accepted_hits.bam treatedA/tophat/accepted_hits.bam,treatedB/tophat/accepted_hits.bam

Analyze results (1) There is more than one file called *.diff cd cuffdiff_red head gene_exp.diff There is more than one file called *.diff Let’s open the one named “isoform_exp.diff” with LibreOffice Calc A status “NOTEST” means that for the gene you don’t have enough data. We only can work on genes with “OK”

Analyze results (2) How many significantly changed genes do we have? Sort by q-value (corrected p-value) The “significant” column has the value “yes” if the p-value of the differential expression is <0.05 after correction for multiple testing This is the set on which you should focus your interest for downstream analysis such as MapMan

Analyze results (3) Cuffdiff/Cufflinks doesn’t provide raw read counts, but calculates RPKMs RPKM (Read per kilobase of exon model per million mapped reads): accounts for both library size and gene length effect To be even more precise, these are called FPKMs, to account for paired reads (Fragments per kilobase per million mapped pairs of reads)

Second approach: green way Use reads to identify novel junctions and/or transcripts

The green way The first step of the green route is still mapping of reads using TopHat (which in turn relies on Bowtie). We can use the same parameters Now, we will exploit Cufflinks to predict new transcripts Cufflinks takes a *.bam format file as input. Authors’ suggest to use the output of TopHat as Cufflink input

Run Cufflinks (1) First, we need to index the ALIGNMENTS Cufflinks requires indexed alignments... cd /home/ngs/data_crunching/RNAseq for alldir in controlA controlB treatedA treatedB do samtools index "$alldir"/tophat/accepted_hits.bam done

Run Cufflinks (2) And then, we run Cufflinks itself: annotation=/home/ngs/data_crunching/genes/Populus_minusculus.gff3 for alldir in controlA controlB treatedA treatedB do cufflinks -g $annotation -o "$alldir"/cufflinks "$alldir"/tophat/accepted_hits.bam done You can change this –g with –G. -g: use the annotation to guide assembly (green way) -G: use the annotation and do not perform assembly (if you use the gff you supplied to tophat, then using –G will be the same as following the red way)

Merge assemblies (1) Now we are comparing not only different mappings, but also different assemblies, each produced by Cufflinks on a different set of reads In order to compare counts from different cufflinks assemblies/mappings, we need a further merging step Create a manifest file, that points cuffmerge to the assemblies produced by the two separate Cufflinks runs: cd /home/ngs/data_crunching/RNAseq nano manifest.txt controlA/cufflinks/transcripts.gtf controlB/cufflinks/transcripts.gtf treatedA/cufflinks/transcripts.gtf treatedB/cufflinks/transcripts.gtf Fill the file with four lines Ctrl + o (save) ENTER Ctrl + x (exit)

Merge assemblies (2) Simply run cuffmerge: cuffmerge -o cuffmerge_dir -g $annotation manifest.txt

Run Cuffdiff Cuffdiff will now work as in the red way: it will just compare comparable gene counts: cd /home/ngs/data_crunching/RNAseq newannotation=cuffmerge_dir/merged.gtf cuffdiff -o cuffdiff_green/ -L Control,Treated $newannotation controlA/tophat/accepted_hits.bam,controlB/tophat/accepted_hits.bam treatedA/tophat/accepted_hits.bam,treatedB/tophat/accepted_hits.bam This is an annotation generated by cuffmerge: it’s an extension of the default one

Analyze results cd cuffdiff_green head gene_exp.diff How many significantly changed genes do we have now? Is this identical to the red way? You can notice that NEW genomic areas previously not known to be transcribed were found by Cufflinks in the green way

Conclusions You now have a list of genes which are differentially expressed between two conditions You can assess the effects of this treatment by checking who these genes are In this case, we used a (dummy) dataset from Populus minuscula, a not very studied plant, therefore before using Mapman we must assign a functionto the (known and newly found by Cufflinks) gene sequences BLAST Mercator http://mapman.gabipd.org/web/guest/app/mercator

Final slide And thanks to Fabio Marroni who provided the data!

CummeRbund is an R package to load and visualize Cuffdiff outputs Install CummeRbund: Here, select «BioC Software» R setRepositories() install.packages("cummeRbund") library("cummeRbund") Load the cuffdiff "green way" output: cuff<-readCufflinks(dir="/home/ngs/data_crunching/RNAseq/cuffdiff_green/") cuff

CummeRbund is an R package to load and visualize Cuffdiff outputs Some plots describing the dataset: disp<-dispersionPlot(genes(cuff)) disp

CummeRbund is an R package to load and visualize Cuffdiff outputs Some plots describing the dataset: brep<-csBoxplot(genes(cuff),replicates=T) brep

CummeRbund is an R package to load and visualize Cuffdiff outputs Some plots describing the dataset: dens<-csDensity (genes(cuff),replicates=T) dens

CummeRbund is an R package to load and visualize Cuffdiff outputs Some plots describing the dataset: s<-csScatter(genes(cuff),“Treated",“Control",smooth=T) s

CummeRbund is an R package to load and visualize Cuffdiff outputs Some plots describing the dataset: dend<-csDendro(genes(cuff),replicates=T) dend

CummeRbund is an R package to load and visualize Cuffdiff outputs Some plots describing the dataset: v<-csVolcano(genes(cuff), "Treated", "Control") v

CummeRbund CummeRbund can do much more... But nothing that R itself can’t do already

CummeRbund CummeRbund can do much more... But nothing that R itself can’t do already

Coexpression between transcripts CummeRbund CummeRbund can do much more... But nothing that R itself can’t do already Coexpression between transcripts

Final slide (2)