The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop RNA-Seq using the Discovery Environment And COGE
What is RNA-Seq?
Gene-Expression studies by sequencing Reverse-Transcribed RNA
What is RNA-Seq? Gene-Expression studies by sequencing Reverse-Transcribed RNA Getting Started….
First --- What if you have a question?
Hint: What if you have a question about anything??
Starting an RNA-Seq Project
Sequencing Illumina Ion Torrent 454 PacBio
So your reads are ready… You’ve uploaded the sequencing files to the iPlant Data Store What’s next? What are the steps for RNA-Seq?
RNA-Seq Conceptual Overview Image source:
The entire RNA-Seq analysis method… Read analysis and cleanup! Map the reads to the genome (if you have a genome sequence ) Assemble the reads into transcripts Map the transcripts to the genome (if you have a genome sequence ) Annotate the transcripts (or wait to later) Map the reads to the genome or directly to the transcripts Count the number of hits per transcript or gene for each condition Analyze counts for different conditions to determine differential expression Then you start thinking more about the Gene Ontology, what types or genes or transcripts are differentially expressed – biology!!
Examining Data Quality with FastQC
RNA-Seq HWUSI-EAS455:3:1:1:1096 length=41 CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC HWUSI-EAS455:3:1:2:1592 length=41 GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT + @SRR HWUSI-EAS455:3:1:2:869 length=41 TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA + HWUSI-EAS455:3:1:4:1075 length=41 CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC HWUSI-EAS455:3:1:5:238 length=41 AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA + @SRR HWUSI-EAS455:3:1:5:1871 length=41 GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG HWUSI-EAS455:3:1:5:1981 length=41 GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC + : …Now What?
$ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq $ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq $ tophat -p 8 -G genes.gtf -o C1_R3_thout genome C1_R3_1.fq C1_R3_2.fq $ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C1_R1_2.fq $ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C1_R2_2.fq $ tophat -p 8 -G genes.gtf -o C2_R3_thout genome C2_R3_1.fq C1_R3_2.fq $ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/accepted_hits.bam $ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/accepted_hits.bam $ cufflinks -p 8 -o C1_R3_clout C1_R3_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R3_clout C2_R3_thout/accepted_hits.bam $ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt $ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf \./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam,\./C1_R3_thout/accepted_hits.bam \./C2_R1_thout/accepted_hits.bam,\./C2_R3_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam Your RNA-Seq Data Your transformed RNA-Seq Data
RNA-Seq Analysis Workflow Tophat (bowtie) Cufflinks Cuffmerge Cuffdiff CummeRbund Your Data iPlant Data Store FASTQ Discovery Environment Atmosphere
RNA-Seq Workflow Overview
TopHat TopHat is one of many applications for aligning short sequence reads to a reference genome. It uses the BOWTIE aligner internally. Other alternatives are GSNAP, BWA, Stampy, etc.
RNA-seq Sample Read Statistics Genome alignments from TopHat were saved as BAM files, the binary version of SAM (samtools.sourceforge.net/). Reads mapped by TopHat are shown below Sequence runWT-1WT-2hy5-1hy5-2 Reads10,866,70210,276,26813,410,01112,471,462 Seq. (Mbase)
RNA-Seq Workflow Overview
Examining Differential Gene Expression
Input Read Files for Tophat
BAM Alignment files – for CuffLinks
GTF – Reference Based Assembly File
Inputs for CuffDiff
CuffDiff Output Output Directories: cuffdiff_out sorted_data
cuffdiff_out directory basic_plots.R bias_params.info cds.count_tracking cds.diff cds_exp.diff cds.fpkm_tracking cds.read_group_tracking cuffData.db gene_exp.diff genes.count_tracking genes.fpkm_tracking genes.read_group_tracking isoform_exp.diff isoforms.count_tracking isoforms.fpkm_tracking isoforms.read_group_tracking promoters.diff read_groups.info run.info splicing.diff tss_group_exp.diff tss_groups.count_tracking tss_groups.fpkm_tracking tss_groups.read_group_tracking var_model.info
cds.diff file test_idgene_idgenelocussample_1sample_2statusvalue_1value_2 log2(fold_ch ange)test_statp_valueq_valuesignificant AT1G01010 ANAC0011: nuclearcytoplasmicOK no AT1G01020 ARV11: nuclearcytoplasmicOK E yes AT1G01030 NGA3 1: nuclearcytoplasmicOK yes AT1G01040 DCL1 1: nuclearcytoplasmicOK no AT1G01046 MIR838A 1: nuclearcytoplasmicNOTEST inf011no AT1G01060 AT1G01060, CUFFLHY 1: nuclearcytoplasmicOK no AT1G : nuclearcytoplasmicOK E yes AT1G : nuclearcytoplasmicNOTEST000011no AT1G : nuclearcytoplasmicOK yes AT1G01100 AT1G01100, CUFFAT1G : nuclearcytoplasmicOK E yes
sorted_data directory genes.sorted_by_expression.sig.txt genes.sorted_by_expression.txt genes.sorted_by_fold.sig.txt genes.sorted_by_fold.txt transcripts.sorted_by_expression.sig.txt transcripts.sorted_by_expression.txt transcripts.sorted_by_fold.sig.txt transcripts.sorted_by_fold.txt
Sorted Differentially expressed genes gene_id gene_nam esample1sample2 fold_chan gedirectiontotal_fpkmq-valuegene_description ATCG000 10TRNHnuclearcytoplasmic3.8600DOWN A chloroplas tgeneencodinga histidine- accepting tRNA..[Source:TAIR;Acc:ATCG0 0010] ATCG002 20PSBMnuclearcytoplasmic4.7900DOWN photosyst emIIreactioncenterprotein M.[Source:TAIR;Acc:ATCG0022 0] AT3G166 40TCTPnuclearcytoplasmic9.3800UP translation ally- controlledtumor protein- likeprotein [Source:EMBL;Acc:AEE ] AT5G nuclearcytoplasmic4.6700UP uncharact erizedprotein [Source:EMBL;Acc:AED ] ATCG000 90TRNS.1nuclearcytoplasmic5.9300DOWN tRNA-Ser.[Source:TAIR;Acc:ATCG00090] AT3G ATARFA1 Enuclearcytoplasmic7.6400UP ADP- ribosylatio nfactorA1E [Source:EMBL;Acc:AEE ] AT2G nuclearcytoplasmic6.0500UP SribosomalproteinL41 [Source:EMBL;Acc:AEC ] AT4G nuclearcytoplasmic4.8300DOWN snoRNA.[Source:TAIR;Acc:AT4 G39366] AT5G nuclearcytoplasmic2.0200DOWN uncharact erizedprotein [Source:EMBL;Acc:AED ] AT4G nuclearcytoplasmic8.6900DOWN snoRNA.[Source:TAIR;Acc:AT4 G39364] AT3G nuclearcytoplasmic4.5600DOWN snoRNA.[Source:TAIR;Acc:AT3 G47347] AT1G283 30DRM1nuclearcytoplasmic3.0000UP dormancy- associate d protein- like [Source:EMBL;Acc:AEE ] AT5G032 40UBQ3nuclearcytoplasmic UP polyubiqui tin [Source:EMBL;Acc:AED ]
The iPlant Collaborative is funded by a grant from the National Science Foundation Plant Cyberinfrastructure Program (#DBI ).
ATG44120 (12S seed storage protein) significantly down-regulated in hy5 mutant Background (> 9-fold p=0). Compare to gene on right lacking differential expression