Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Introduction to RNA-Seq Transcriptome Profiling with iPlant

Similar presentations


Presentation on theme: "An Introduction to RNA-Seq Transcriptome Profiling with iPlant"— Presentation transcript:

1 An Introduction to RNA-Seq Transcriptome Profiling with iPlant

2

3 Before we start: Align sequence reads to the reference genome
The most time-consuming part of the analysis is doing the alignments of the reads (in Sanger fastq format) for all replicates against the reference genome. Make sure everyone has gotten the four replicates loaded into the new Tophat implementation that accepts multiple fastq files and runs them serially (TopHat-1.4.1) at the beginning of the lecture. This takes the most time but will finish for most people while you do the lecture.

4 RNA-seq in the Discovery Environment
Overview: This training module is designed to demonstrate a workflow in the iPlant Discovery Environment using RNA-Seq for transcriptome profiling. Question: How can we compare gene expression levels using RNA-Seq data in Arabidopsis WT and hy5 genetic backgrounds?

5 Scientific Objective LONG HYPOCOTYL 5 (HY5) is a basic leucine zipper transcription factor (TF). Mutations cause aberrant phenotypes in Arabidopsis morphology, pigmentation and hormonal response. We will use RNA-seq to compare WT and hy5 to identify HY5-regulated genes. Source:

6 Samples Experimental data downloaded from the NCBI Short Read Archive (GEO:GSM and GEO:GSM613466) Two replicates each of RNA-seq runs for Wild-type and hy5 mutant seedlings.

7 RNA-Seq Conceptual Overview
This is a quick visual overview of transcriptome profiling via RNA-seq. It does not go into comparisons but we cover that with CuffDiff later. Image source:

8 RNA-seq Sample Read Statistics
Genome alignments from TopHat were saved as BAM files, the binary version of SAM (samtools.sourceforge.net/). Reads retained by TopHat are shown below Sequence run WT-1 WT-2 hy5-1 hy5-2 Reads 10,866,702 10,276,268 13,410,011 12,471,462 Seq. (Mbase) 445.5 421.3 549.8 511.3 These are the read counts generated by TopHat as part of its alignment analysis. This is a modestly sized data set by NGS standard; good time to mention scalability, Data Store, etc.

9 RNA-Seq Data …Now What? @SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41
CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC + @SRR HWUSI-EAS455:3:1:2:1592 length=41 GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT @SRR HWUSI-EAS455:3:1:2:869 length=41 TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA @SRR HWUSI-EAS455:3:1:4:1075 length=41 CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC @SRR HWUSI-EAS455:3:1:5:238 length=41 AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA @SRR HWUSI-EAS455:3:1:5:1871 length=41 GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG @SRR HWUSI-EAS455:3:1:5:1981 length=41 GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC …Now What?

10 1 1 1 @SRR HWUSI-EAS455:3:1:1:1096 length=41 CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC + @SRR HWUSI-EAS455:3:1:2:1592 length=41 GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT @SRR HWUSI-EAS455:3:1:2:869 length=41 TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA @SRR HWUSI-EAS455:3:1:4:1075 length=41 CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC @SRR HWUSI-EAS455:3:1:5:238 length=41 AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA @SRR HWUSI-EAS455:3:1:5:1871 length=41 GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG @SRR HWUSI-EAS455:3:1:5:1981 length=41 GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC 1 1 Bioinformatician

11

12 The Tuxedo Protocol

13 Your transformed RNA-Seq Data
Your RNA-Seq Data $ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq $ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq $ tophat -p 8 -G genes.gtf -o C1_R3_thout genome C1_R3_1.fq C1_R3_2.fq $ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C1_R1_2.fq $ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C1_R2_2.fq $ tophat -p 8 -G genes.gtf -o C2_R3_thout genome C2_R3_1.fq C1_R3_2.fq $ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/accepted_hits.bam $ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/accepted_hits.bam $ cufflinks -p 8 -o C1_R3_clout C1_R3_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R3_clout C2_R3_thout/accepted_hits.bam $ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt $ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf \ ./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam,\ ./C1_R3_thout/accepted_hits.bam \./C2_R1_thout/accepted_hits.bam,\ ./C2_R3_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam Your transformed RNA-Seq Data

14 RNA-Seq Analysis Workflow
Tophat (bowtie) Cufflinks Cuffmerge Cuffdiff CummeRbund Your Data iPlant Data Store FASTQ Discovery Environment Atmosphere This is a quick visual overview of transcriptome profiling via RNA-seq. It does not go into comparisons but we cover that with CuffDiff later.

15 The iPlant Discovery Environment

16 The iPlant Discovery Environment

17 The iPlant Discovery Environment

18 The iPlant Discovery Environment

19 Getting the RNA-Seq Data
Import SRA data from NCBI SRA Extract FASTQ files from the downloaded SRA archives These steps are pre-done to make the work-flow fit into the module time allocation. Spend a moment explaining the provenance (ie getting the data from NCBI, SRA-lite format) Explain that the fastq dumper rescales the quality scores to the Sanger convention for fastq Let them know we did this for them in advance

20 Staged Data

21 Examining Data Quality with fastQC

22 Tophat Explain reference-sequence based NGS read alignments.
Explain that we are skipping the cufflinks step because the Arabidopsis transcriptome is so well annotated that we can use the TAIR gene models as our refernce transcripts for CuffDiff

23 Tophat in the Discovery Environment

24 Align Reads to the Genome
Align the four FASTQ files to Arabidopsis genome using Tophat They will have done this part by now.

25 It uses the BOWTIE aligner internally.
TopHat TopHat is one of many applications for aligning short sequence reads to a reference genome. It uses the BOWTIE aligner internally. Other alternatives are BWA, MAQ, OLego, Stampy, Novoalign, etc. Emphasize that the TopHat aligner is one of many choices. Let them know that others are available in the DE and they can also integrate their own if they want to.

26 Explain this figure: The gene on the left is differentially expressed (down-regulated in hy5). Compare to gene on right that is not differentially expressed in the two samples. ATG44120 (12S seed storage protein) significantly down-regulated in hy5 mutant Background (> 9-fold p=0). Compare to gene on right lacking differential expression

27 Assembling the Transcripts
Explain that we are skipping the cufflinks step because the Arabidopsis transcriptome is so well annotated that we can use the TAIR gene models as our refernce transcripts for CuffDiff

28 Cufflinks in the Discovery Environment
Introducing CuffDiff with replicates

29 Cufflinks Explain that there are various text manipulation tools integrated into the DE (grep, cut, awk etc) for very configurable modular analysis Of the tabular output data from CuffDiff. Then segue into the Filter_CuffDiff_Results App, which consolidates some of these steps.

30 Merging the Transcriptomes
Explain that we are skipping the cufflinks step because the Arabidopsis transcriptome is so well annotated that we can use the TAIR gene models as our refernce transcripts for CuffDiff

31 Cufffmerge in the Discovery Environment
Introducing CuffDiff with replicates

32 Cuffmerge

33 Comparing wild-type to hy5 transcriptomes
Explain that we are skipping the cufflinks step because the Arabidopsis transcriptome is so well annotated that we can use the TAIR gene models as our refernce transcripts for CuffDiff

34 Cuffdiff in the Discovery Environment
Introducing CuffDiff with replicates

35 Cuffdiff

36 Cuffdiff Results

37 Differentially expressed genes
Example filtered CuffDiff results generated with the Filter_CuffDiff_Results to Select genes with minimum two-fold expression difference Select genes with significant differential expression (q <= 0.05) Add gene descriptions

38 Density Plot

39 Scatter Plot

40 Volcano Plot

41 Expression Plots Explain that we are skipping the cufflinks step because the Arabidopsis transcriptome is so well annotated that we can use the TAIR gene models as our refernce transcripts for CuffDiff

42 Cloud Computing with iPlant Atmosphere
Introducing CuffDiff with replicates

43 Launch a Virtual Server (in the Cloud!)

44 You now have your very own virtual linux server

45 Expression Plots: Open a terminal and launch R

46 Expression Plots: Demonstration


Download ppt "An Introduction to RNA-Seq Transcriptome Profiling with iPlant"

Similar presentations


Ads by Google