6Data upload reviewOur data are H1 human embryonic stem cell RNA Seq data from the CalTech encode project. Single end reads from Illumina.
7Typical RNA_Seq Project Work Flow Tissue SampleTotal RNAmRNAcDNAFASTQ fileSequencingQCVisualizationTopHatCufflinksGene/Transcript/Exon ExpressionStatistical AnalysisJAX Computational Sciences Service
8Prior to alignment, perform some quality control (QC) assessments of the data. Here we use FastQC **.**http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
9FastQC provides a wide range of QC checks FastQC provides a wide range of QC checks. Here we will only look at “Per base sequence quality”
10Sequence quality per base position Good dataConsistentHigh Quality Along the readsBad dataHigh VarianceQuality Decrease with LengthThe central red line is the median valueThe yellow box represents the inter-quartile range (25-75%)The upper and lower whiskers represent the 10% and 90% pointsThe blue line represents the mean quality
11Position along sequencing read Quality ScoreOur data…Position along sequencing read
12Galaxy has several tools for trimming sequences, removing adapters, etc. prior to alignment. Using the information from FastQC, let’s trim our input sequences so that the aggregate quality score is 15.
13Typical RNA_Seq Project Work Flow Tissue SampleTotal RNAmRNAcDNAFASTQ fileSequencingQCVisualizationTopHatCufflinksGene/Transcript/Exon ExpressionStatistical AnalysisJAX Computational Sciences Service
14TopHat http://tophat.cbcb.umd.edu/ TopHat is a good tool for aligning RNA Seq data compared to other aligners (Maq, BWA) because it takes splicing into account during the alignment process.Figure from: Trapnell et al. (2010). Nature Biotechnology 28:Trapnell et al. (2009). Bioinformatics 25:
15Setting parameters for TopHat in Galaxy Be sure to use the quality trimmed sequences!
16Does it seem like your Galaxy jobs never finish?! Galaxy is increasingly popular so it can take time for some of these computationally expensive processes to run…don’t restart your job or you will go to the end of the line!Your job will continue to run on the Galaxy servers even if you shut down your computer.
17For now we have pre-computed data to illustrate the main points!
18Visualizing alignments in Galaxy When TopHat finishes the alignments are available in BAM format.
19You can look at the alignments in a variety of browsers…. Which browser you choose is a matter of personal preference.
20UCSC Browser…the track and the title of the track are made automatically for you from Galaxy. UCSC also has controls to let you display many other kinds of annotations as tracks.chr19:2,373,346-2,398,357
26Typical RNA_Seq Project Work Flow Tissue SampleTotal RNAmRNAcDNAFASTQ fileSequencingQCVisualizationTopHatCufflinksGene/Transcript/Exon ExpressionStatistical AnalysisJAX Computational Sciences Service
27Cufflinks Assembles transcripts, Estimates their abundances, and Assembles transcripts,Estimates their abundances, andTests for differential expression and regulation in RNA-Seq samplesTrapnell et al. (2010). Nature Biotechnology 28:
28There are several ways to generate annotation files for Cufflinks to use. Here we will create an annotation file using the UCSC genome browser tool in Galaxy.ABThere are many options for the features to include in the annotation file.Cufflinks expects a GTF file format
29Once you have selected your annotations…you can send them directly to your history in Galaxy.
30Setting parameters for Cufflinks Use the reference annotations you just downloaded…
31Example of an RNA Seq data set in NCBI’s Gene Expression Omnibus (GEO)…you don’t always need the raw sequences to do RNA Seq, you can start with a SAM or BAM file.
32SAM files need to be converted into BAM format in order to run Cufflinks…. There’s a tool in Galaxy for that!!
33Cufflinks output can be downloaded and viewed in Excel.
34RPKM vs FPKMReads Per Kilobase of transcript per Million mapped reads (RPKM)Used for single end sequencing readsCount # of uniquely mappable reads to a set of exons that constitute a gene prediction/model.Fragments Per Kilobase of exon per Million fragments mapped (FPKM)Used for paired-end sequence dataFPKM is an estimate of the number of reads per transcriptTopHat aligns reads to the genomeCufflinks assembles reads into transcript models/fragmentsCufflinks counts the number of reads per fragment to estimate FPKMFPKM is used as an indication of expression level for a gene
35Quantification of gene expression using RNA Seq can be complicated by reads that don’t map uniquely to the genome. RNA Seq by Expectation Maximization (RSEM) takes mapping uncertainty into account when estimating expression levels.
36Typical RNA_Seq Project Work Flow Tissue SampleTotal RNAmRNAcDNAFASTQ fileSequencingQCVisualizationTopHatCufflinksGene/Transcript/Exon ExpressionStatistical AnalysisJAX Computational Sciences Service
37Differential Gene Expression For RNA Seq data from multiple conditions, Cuffdiff can be used to detect significant differences in transcript expression..Is the abundance of transcripts different between two samples?
38Is there a difference in total expression of a given gene due to treatment conditions? edgeRDESeqBayseq
39Don’t just go along for the ride! Summing UpAlignments, Assemblies, and Annotations are essential to using Next Gen sequence data for biological investigationKnow the strengths and weaknesses of eachHave Fun! But Be Careful!Don’t just go along for the ride!
40Tutorial Web SiteThis site will be accessible after the meeting. Check back for updates and new tutorials.
41SEQanswers is a very active public discussion board on sequence analysis issues.