The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Intro to RNA-Seq with the Tuxedo Suite
Experiment Overview Goals Determine differential expression abundance of transcripts in between a WT and mutant organism
RNA-Seq Overview Basic concept Image source: http://www.bgisequence.com
Experiment Overview Example experiment http://www.biomedcentral.com/1471-2164/14/903
Now what? @SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41 CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC + BA?39AAA933BA05>A@A=?4,9################# @SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41 GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT @=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48## @SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41 TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/9? @SRR070570.32 HWUSI-EAS455:3:1:4:1075 length=41 CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:3B? @SRR070570.40 HWUSI-EAS455:3:1:5:238 length=41 AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@ @SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41 GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB> @SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41 GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC ?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:
Now what? @SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41 CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC + BA?39AAA933BA05>A@A=?4,9################# @SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41 GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT @=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48## @SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41 TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/9? @SRR070570.32 HWUSI-EAS455:3:1:4:1075 length=41 CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:3B? @SRR070570.40 HWUSI-EAS455:3:1:5:238 length=41 AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@ @SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41 GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB> @SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41 GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC ?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:
Getting a feel for the data FASTQ format
Now what? 1 Bioinformatician 1 1 1 1 1 1 1 @SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41 CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC + BA?39AAA933BA05>A@A=?4,9################# @SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41 GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT @=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48## @SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41 TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/9? @SRR070570.32 HWUSI-EAS455:3:1:4:1075 length=41 CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:3B? @SRR070570.40 HWUSI-EAS455:3:1:5:238 length=41 AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@ @SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41 GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB> @SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41 GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC ?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>: 1 1 Bioinformatician
Papers and Background Read these first!
Tuxedo Workflow Differential expression *TopHat and Cufflinks require a sequenced genome Differential expression
No reference genome? Resources
Standards for RNA Suggestions before you sequence http://encodeproject.org/ENCODE/protocols/dataStandards/ENCODE_RNAseq_Standards_V1.0.pdf http://www.nature.com/nbt/journal/v32/n9/full/nbt.3025.html
Command line version Your RNA-Seq Data Your transformed RNA-Seq Data $ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq $ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq $ tophat -p 8 -G genes.gtf -o C1_R3_thout genome C1_R3_1.fq C1_R3_2.fq $ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C1_R1_2.fq $ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C1_R2_2.fq $ tophat -p 8 -G genes.gtf -o C2_R3_thout genome C2_R3_1.fq C1_R3_2.fq $ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/accepted_hits.bam $ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/accepted_hits.bam $ cufflinks -p 8 -o C1_R3_clout C1_R3_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R3_clout C2_R3_thout/accepted_hits.bam $ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt $ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf \ ./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam,\ ./C1_R3_thout/accepted_hits.bam \./C2_R1_thout/accepted_hits.bam,\ ./C2_R3_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam Your transformed RNA-Seq Data Your RNA-Seq Data
Discovery Environment Tophat (bowtie) Cufflinks Cuffmerge Cuffdiff CummeRbund Using a GUI Your Data iPlant Data Store FASTQ Discovery Environment Atmosphere
Moving your data in Complete documentation www.iplantc.org/ds1
iDrop Desktop Easy to use!
Discovery Environment Easy to use!
Decompress your data Know what files you have
Remove barcodes? Demultiplexing and adapter trimming Image from: http://www.westburg.eu/lp/rna-seq-library-preparation Pre-process sequences if needed (e.g., Sabre for de-multiplexing reads, and Scythe for removing primer/adapter sequences)
Quality Control FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Quality Control Per base sequence quality BAD GOOD The central red line is the median value The yellow box represents the inter-quartile range (25-75%) The upper and lower whiskers represent the 10% and 90% points The blue line represents the mean quality
Quality Control Per sequence quality BAD GOOD Fail: most frequently observed mean quality is below 20 (1% error rate)
Quality Control Fail: error if any of the sequences have zero length. Sequence length distribution GOOD Fail: error if any of the sequences have zero length.
Quality Control Overrepresented sequences Potentially BAD Fail: module will issue an error if any sequence is found to represent more than 1% of the total
TopHat Maps reads to reference genome
TopHat Maps reads to reference genome TopHat is one of many applications for aligning short sequence reads to a reference genome. It uses the BOWTIE aligner internally. Other alternatives are BWA, MAQ, STAR,OLego, Stampy, Novoalign, etc.
TopHat Maps reads to reference genome TopHat has a number of parameters and options, and their default values are tuned for processing mammalian RNA-Seq reads. If you would like to use TopHat for another class of organism, we recommend setting some of the parameters with more strict, conservative values than their defaults. Usually, setting the maximum intron size to 4 or 5 Kb is sufficient to discover most junctions while keeping the number of false positives low. - TopHat User Manual
IGV Visualize mapped reads
Cufflinks Assemble transcripts
Cufflinks Hint: Provide a mask file (gtf/gff) Assemble transcripts Hint: Provide a mask file (gtf/gff) Tells Cufflinks to ignore all reads that could have come from transcripts in this GTF file. Annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore. - Cufflinks User Manual
Cufflinks - Cufflinks User Manual Assemble transcripts 1) transcripts.gtf This GTF file contains Cufflinks' assembled isoforms. The first 7 columns are standard GTF, and the last column contains attributes, some of which are also standardized ("gene_id", and "transcript_id"). There one GTF record per row, and each record represents either a transcript or an exon within a transcript. 2) isoforms.fpkm_tracking This file contains the estimated isoform-level expression values (FPKM). 3) genes.fpkm_tracking This file contains the estimated gene-level expression values (FPKM). - Cufflinks User Manual
Cufflinks Assemble transcripts
Cuffmerge Assemble transcriptome from RABT and Cufflinks Cuffmerge is a meta-assembler; Assembly of Cufflinks transcripts / Reference based assembly
Cuffdiff Determine sample differences
Cuffdiff Determine sample differences Cuffdiff evaluates variation in read counts for each gene across the replicates this estimate is used to calculate significance of expression changes Cuffdiff can identify genes that are differentially spliced or differentially regulated via promoter switching. Isoforms of a gene that have the same TSS are grouped Detection rate of differentially expressed genes/transcripts is strongly dependent on sequencing depth
Changes in fragment counts ≠ changes in expression Cuffdiff Determine sample differences Changes in fragment counts ≠ changes in expression True expression is estimated by the sum of the length-normalized isoform read counts so the entire transcript must be taken into account.
Cuffdiff Determine sample differences 1. FPKM tracking files Cuffdiff calculates the FPKM of each transcript, primary transcript, and gene in each sample. Primary transcript and gene FPKMs are computed by summing the FPKMs of transcripts in each primary transcript group or gene group. (tss_groups.fpkm_tracking tracks summed FPKM of transcripts sharing tss_ids) 2) Count tracking files Estimate of the number of fragments that originated from each transcript, primary transcript, and gene in each sample. 3) Read group tracking files Expression and fragment count for each transcript, primary transcript, and gene in each replicate. 4) Differential expression tests Tab delimited file lists the results of differential expression testing between samples for spliced transcripts, primary transcripts, genes, and coding sequences. Plus several other outputs (diff splicing, CDS, promoter, etc.)
Cuffdiff Determine sample differences Example filtered Cuffdiff results generated in the Discovery Environment.
Cuffdiff Determine sample differences Example filtered Cuffdiff results generated in the Discovery Environment.
Cuffdiff Density plot
Cuffdiff Scatter plot
Cuffdiff Volcano plot
CummeRbund Using R in Atmosphere (tomorrow)
Keep asking: ask.iplantcollabortive.org
The iPlant Collaborative is funded by a grant from the National Science Foundation Plant Cyberinfrastructure Program (#DBI-0735191).