An Introduction to RNA-Seq Transcriptome Profiling with iPlant

Slides:

Advertisements

Similar presentations

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy

Advertisements

Advanced ChIP-seq Identification of consensus binding sites for the LEAFY transcription factor Explain that you can use your own data Explain that data.

DEG Mi-kyoung Seo.

Introduction To Next Generation Sequencing (NGS) Data Analysis

RNA-seq Analysis in Galaxy

RNA-Seq data analysis Qi Liu Department of Biomedical Informatics

Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani

Before we start: Align sequence reads to the reference genome

NGS Analysis Using Galaxy

DNA Subway Green Line Overview. Growth of Sequence Read Archive (SRA) 2.2 Quadrillion bases Log Scale!

RNA-Seq Visualization

Introduction to RNA-Seq and Transcriptome Analysis

Expression Analysis of RNA-seq Data

Genomics Virtual Lab: analyze your data with a mouse click Igor Makunin School of Agriculture and Food Sciences, UQ, April 8, 2015.

Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.

BIF Group Project Group (A)rabidopsis: David Nieuwenhuijse Matthew Price Qianqian Zhang Thijs Slijkhuis Species: C. Elegans Project: Advanced.

Advanced ChIPseq Identification of consensus binding sites for the LEAFY transcription factor.

Introduction to RNA-Seq & Transcriptome Analysis

Schedule change Day 2: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq) Day 2: PM - RNA-Seq practical (Tophat + Cuffdiff pipeline on Galaxy) Day.

TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.

Transcriptome Analysis

Next Generation Sequencing. Overview of RNA-seq experimental procedures. Wang L et al. Briefings in Functional Genomics 2010;9: © The Author.

RNA-Seq in Galaxy Igor Makunin QAAFI, Internal Workshop, April 17, 2015.

An Introduction to RNA-Seq Transcriptome Profiling with iPlant.

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

Introduction to RNA-Seq

The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop RNA-Seq using the Discovery Environment And COGE.

Introduction To Next Generation Sequencing (NGS) Data Analysis

BIF Group Project Group (A)rabidopsis: David Nieuwenhuijse Matthew Price Qianqian Zhang Thijs Slijkhuis.

IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.

Introduction to RNAseq

The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop RNA-Seq visualization with cummeRbund.

Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.

The iPlant Collaborative

An Introduction to RNA-Seq Transcriptome Profiling with iPlant (

The iPlant Collaborative

No reference available

RNA-Seq in Galaxy Igor Makunin DI/TRI, March 9, 2015.

Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.

Canadian Bioinformatics Workshops

RNA-Seq visualization with CummeRbund

Canadian Bioinformatics Workshops

Introduction to Exome Analysis in Galaxy Carol Bult, Ph.D. Professor Deputy Director, JAX Cancer Center Short Course Bioinformatics Workshops 2014 Disclaimer…I.

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Overview of Genomics Workflows

High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.

Canadian Bioinformatics Workshops

RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -

Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.

Canadian Bioinformatics Workshops

Arrays How do they work ? What are they ?. WT Dwarf Transgenic Other species Arrays are inverted Northerns: Extract target RNA YFG Label probe + hybridise.

Transcriptomics History and practice.

Introductory RNA-seq Transcriptome Profiling

GCC Workshop 9 RNA-Seq with Galaxy

Cancer Genomics Core Lab

WS9: RNA-Seq Analysis with Galaxy (non-model organism )

Advanced ChIP-seq Identification of consensus binding sites for the LEAFY transcription factor Explain that you can use your own data Explain that data.

RNA-Seq visualization with CummeRbund

How to store and visualize RNA-seq data

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Introductory RNA-Seq Transcriptome Profiling

Introduction To Next Generation Sequencing (NGS) Data Analysis

Transcriptomics History and practice.

Additional file 2: RNA-Seq data analysis pipeline

Transcriptomics – towards RNASeq – part III

RNA-Seq Data Analysis UND Genomics Core.

Presentation transcript:

An Introduction to RNA-Seq Transcriptome Profiling with iPlant

Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the alignments of the reads (in Sanger fastq format) for all replicates against the reference genome. Make sure everyone has gotten the four replicates loaded into the new Tophat implementation that accepts multiple fastq files and runs them serially (TopHat-1.4.1) at the beginning of the lecture. This takes the most time but will finish for most people while you do the lecture.

RNA-seq in the Discovery Environment Overview: This training module is designed to demonstrate a workflow in the iPlant Discovery Environment using RNA-Seq for transcriptome profiling. Question: How can we compare gene expression levels using RNA-Seq data in Arabidopsis WT and hy5 genetic backgrounds?

Scientific Objective LONG HYPOCOTYL 5 (HY5) is a basic leucine zipper transcription factor (TF). Mutations cause aberrant phenotypes in Arabidopsis morphology, pigmentation and hormonal response. We will use RNA-seq to compare WT and hy5 to identify HY5-regulated genes. Source: http://www.gla.ac.uk/media/media_73736_en.jpg

Samples Experimental data downloaded from the NCBI Short Read Archive (GEO:GSM613465 and GEO:GSM613466) Two replicates each of RNA-seq runs for Wild-type and hy5 mutant seedlings.

RNA-Seq Conceptual Overview This is a quick visual overview of transcriptome profiling via RNA-seq. It does not go into comparisons but we cover that with CuffDiff later. Image source: http://www.bgisequence.com

RNA-seq Sample Read Statistics Genome alignments from TopHat were saved as BAM files, the binary version of SAM (samtools.sourceforge.net/). Reads retained by TopHat are shown below Sequence run WT-1 WT-2 hy5-1 hy5-2 Reads 10,866,702 10,276,268 13,410,011 12,471,462 Seq. (Mbase) 445.5 421.3 549.8 511.3 These are the read counts generated by TopHat as part of its alignment analysis. This is a modestly sized data set by NGS standard; good time to mention scalability, Data Store, etc.

RNA-Seq Data …Now What? @SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41 CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC + BA?39AAA933BA05>A@A=?4,9################# @SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41 GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT @=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48## @SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41 TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/9? @SRR070570.32 HWUSI-EAS455:3:1:4:1075 length=41 CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:3B? @SRR070570.40 HWUSI-EAS455:3:1:5:238 length=41 AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@ @SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41 GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB> @SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41 GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC ?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>: …Now What?

1 1 1 @SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41 CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC + BA?39AAA933BA05>A@A=?4,9################# @SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41 GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT @=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48## @SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41 TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/9? @SRR070570.32 HWUSI-EAS455:3:1:4:1075 length=41 CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:3B? @SRR070570.40 HWUSI-EAS455:3:1:5:238 length=41 AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@ @SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41 GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB> @SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41 GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC ?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>: 1 1 Bioinformatician

The Tuxedo Protocol

Your transformed RNA-Seq Data Your RNA-Seq Data $ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq $ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq $ tophat -p 8 -G genes.gtf -o C1_R3_thout genome C1_R3_1.fq C1_R3_2.fq $ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C1_R1_2.fq $ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C1_R2_2.fq $ tophat -p 8 -G genes.gtf -o C2_R3_thout genome C2_R3_1.fq C1_R3_2.fq $ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/accepted_hits.bam $ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/accepted_hits.bam $ cufflinks -p 8 -o C1_R3_clout C1_R3_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R3_clout C2_R3_thout/accepted_hits.bam $ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt $ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf \ ./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam,\ ./C1_R3_thout/accepted_hits.bam \./C2_R1_thout/accepted_hits.bam,\ ./C2_R3_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam Your transformed RNA-Seq Data

RNA-Seq Analysis Workflow Tophat (bowtie) Cufflinks Cuffmerge Cuffdiff CummeRbund Your Data iPlant Data Store FASTQ Discovery Environment Atmosphere This is a quick visual overview of transcriptome profiling via RNA-seq. It does not go into comparisons but we cover that with CuffDiff later.

The iPlant Discovery Environment

The iPlant Discovery Environment

The iPlant Discovery Environment

The iPlant Discovery Environment

Getting the RNA-Seq Data Import SRA data from NCBI SRA Extract FASTQ files from the downloaded SRA archives These steps are pre-done to make the work-flow fit into the module time allocation. Spend a moment explaining the provenance (ie getting the data from NCBI, SRA-lite format) Explain that the fastq dumper rescales the quality scores to the Sanger convention for fastq Let them know we did this for them in advance

Staged Data

Examining Data Quality with fastQC

Tophat Explain reference-sequence based NGS read alignments. Explain that we are skipping the cufflinks step because the Arabidopsis transcriptome is so well annotated that we can use the TAIR gene models as our refernce transcripts for CuffDiff

Tophat in the Discovery Environment

Align Reads to the Genome Align the four FASTQ files to Arabidopsis genome using Tophat They will have done this part by now.

It uses the BOWTIE aligner internally. TopHat TopHat is one of many applications for aligning short sequence reads to a reference genome. It uses the BOWTIE aligner internally. Other alternatives are BWA, MAQ, OLego, Stampy, Novoalign, etc. Emphasize that the TopHat aligner is one of many choices. Let them know that others are available in the DE and they can also integrate their own if they want to.

Explain this figure: The gene on the left is differentially expressed (down-regulated in hy5). Compare to gene on right that is not differentially expressed in the two samples. ATG44120 (12S seed storage protein) significantly down-regulated in hy5 mutant Background (> 9-fold p=0). Compare to gene on right lacking differential expression

Assembling the Transcripts Explain that we are skipping the cufflinks step because the Arabidopsis transcriptome is so well annotated that we can use the TAIR gene models as our refernce transcripts for CuffDiff

Cufflinks in the Discovery Environment Introducing CuffDiff-1.3.0 with replicates

Cufflinks Explain that there are various text manipulation tools integrated into the DE (grep, cut, awk etc) for very configurable modular analysis Of the tabular output data from CuffDiff. Then segue into the Filter_CuffDiff_Results App, which consolidates some of these steps.

Merging the Transcriptomes Explain that we are skipping the cufflinks step because the Arabidopsis transcriptome is so well annotated that we can use the TAIR gene models as our refernce transcripts for CuffDiff

Cufffmerge in the Discovery Environment Introducing CuffDiff-1.3.0 with replicates

Cuffmerge

Comparing wild-type to hy5 transcriptomes Explain that we are skipping the cufflinks step because the Arabidopsis transcriptome is so well annotated that we can use the TAIR gene models as our refernce transcripts for CuffDiff

Cuffdiff in the Discovery Environment Introducing CuffDiff-1.3.0 with replicates

Cuffdiff

Cuffdiff Results

Differentially expressed genes Example filtered CuffDiff results generated with the Filter_CuffDiff_Results to Select genes with minimum two-fold expression difference Select genes with significant differential expression (q <= 0.05) Add gene descriptions

Density Plot

Scatter Plot

Volcano Plot

Expression Plots Explain that we are skipping the cufflinks step because the Arabidopsis transcriptome is so well annotated that we can use the TAIR gene models as our refernce transcripts for CuffDiff

Cloud Computing with iPlant Atmosphere Introducing CuffDiff-1.3.0 with replicates

Launch a Virtual Server (in the Cloud!)

You now have your very own virtual linux server

Expression Plots: Open a terminal and launch R

Expression Plots: Demonstration