Presentation on theme: "Introduction to RNA-Seq and Transcriptome Analysis"— Presentation transcript:
1 Introduction to RNA-Seq and Transcriptome Analysis Bacterial Genome Assembly | Victor JongeneelIntroduction to RNA-Seq and Transcriptome AnalysisHands – on activities (Fun with UNIX!)PowerPoint: Jessica Kirkpatrick and Casey HansonRNA-Seq Lab | Jessica Kirkpatrick | 2015
2 Exercise Use the Tuxedo Suite to: Align RNA-Seq reads using TopHat (splice-aware aligner).Perform reference-based transcriptome assembly with Cufflinks.Obtain a new transcriptome using Cufflinks & Cuffmerge.Use Cuffdiff to obtain a list of differentially expressed genes.Report a list of significantly expressed genes.Use a genome browser and visualization tool to observe the aligned data and the new transcriptome.
3 Cufflinks does reference-based transcriptome assembly Tuxedo SuiteBowtie and Bowtie use Burrows-Wheeler indexing for aligning reads. With bowtie2 there is no upper limit on the read lengthTophat uses either Bowtie or Bowtie2 to align reads in a splice-aware manner and aids the discovery of new splice junctionsThe Cufflinks package has 4 components, the 2 major ones are listed below -Cufflinks does reference-based transcriptome assemblyCuffdiff does statistical analysis and identifies differentially expressed transcripts in a simple pairwise comparison, and a series of pairwise comparisons in a time-course experimentTrapnell et al., Nature Protocols, March 2012
4 Premise1. Procedure:Run 1: Allow TopHat to select splice junctions and proceed through the steps without giving the software any information about known genes/gene models.Run 2: Force TopHat to use only known splice junctions (i.e. known genes/gene models) and proceed through the steps making sure we are doing our analysis in the context of these gene models.2. Evaluation:a. 2 metrics:# of mapped reads and # of significantly different identified genesb. Compare new transcriptome to known genes.Question: Is there a difference in the results if the Tuxedo Suite is run 2 different ways?
6 Input data RNA-Seq: 100 bp, single end data Genome & gene information: samplereplicate #fastq name# readscontrolReplicate 1thrombin_control.fastq10,953experimentthrombin_expt.fastq12,027Genome & gene information:namedescriptionchr22.faFasta file with the sequence of chromosome 22 from the human genome (hg19 – UCSC) (reference genome)genes-chr22.gtfGTF file with gene annotation, known genes (hg19 – UCSC)
7 Sign in to GalaxyGo to Click on the button Sign in using your classroom ID and password
8 How Galaxy works with the biocluster MAY WANT TO DELETE THISBioclusterSigning up -Usage and cost -Christopher Fields
10 Accessing the input files The data are located in the following directory:/home/classroom/rnaseq-mayo/The rnaseq-mayo directory contains an input_data folder as well as a results folder.(Note “~” is a symbol in UNIX paths referring to your home directory).$ mkdir rnaseq-mayo# Make a working directory in your home directory.$ cp /home/classroom/rnaseq-mayo/input_data/* ~/rnaseq-mayo/# Copy data to your working directory.$ qsub -I -q classroom -l nodes=1;ppn=4# Login to a “classroom” computer on the cluster with 4 processors and in an interactive mode.
11 Getting data into Galaxy (Method 4) Click on the “Shared Data” pulldown menu Click on “Published Histories”
15 Now your current history is the imported history, called “imported: RNA-Seq Chr 22 Data” In the top right corner of the history panel is a wheel, click on that wheel
16 Getting dataThe pulldown menu that is revealed when you click on the wheel has many options that are worth exploring… Right now we are interested in the “Copy Datasets” option Basically, we want to copy the data we have in this imported history to our previously created “RNA-Seq workshop” history
17 Getting data into Galaxy (Method 4) For your “Source History”, select the imported one and for your “Destination History”, select the RNA-Seq workshop Select all the datasets that you want to copy to the “RNA-Seq workshop” history Click on “Copy History Items”
21 Aligning reads using TopHat We are not going to provide any genic structure information. TopHat will find splice junctions on its own.
22 Aligning reads using TopHat Always read the instructions before running softwareIn the left tools panel search for tophat2Click on tophat2, this will result in the central panel showing you all the options for tophat2Remember you need the quality values in your fastq to be phred 33, or Sanger scores
23 Aligning reads using TopHat2 Run 1:No genic structure information (i.e. no GTF file)TopHat2 will find splice junctions on its ownRun this on experimental & control data.Run 2:Genic structure information will be usedRun this on experimental data.
24 Alignment with Tophat2: Run 1 In the left tools panel search for tophat2Click on tophat2, this will result in the central panel showing you all the options for tophat2Remember you need the quality values in your fastq to be phred 33, or Sanger scoresRNA-Seq Lab | Jessica Kirkpatrick | 2015
25 Alignment with Tophat2: Run 1 RNA-Seq Lab | Jessica Kirkpatrick | 2015
26 Alignment with Tophat2: Run 1 Ask about split-segmentRNA-Seq Lab | Jessica Kirkpatrick | 2015
27 Alignment with Tophat2: Run 1 Click “Execute” once you have made all the selections.RNA-Seq Lab | Jessica Kirkpatrick | 2015
28 Alignment with Tophat2: Run 1 Now we want to start a new tophat2 run for another fastq file in the RNA-Seq workshop historyRNA-Seq Lab | Jessica Kirkpatrick | 2015
29 Alignment with Tophat2: Run 1 Now we want to start a new tophat2 run for the control fastq file in the RNA-Seq workshop historySince this is “re run”, all the parameters should be the same; this makes it easy to replicate runs, and easy to go back and check run parameters.Always re-label new files immediately with names that makes sense to you, by clicking on the pencil and changing attributesRNA-Seq Lab | Jessica Kirkpatrick | 2015
30 Rename FilesOn Galaxy its important to rename your files to something meaningful
31 Evaluating alignment: Run 1 How many reads DID NOT align to the reference genome chr22?
33 Aligning reads using TopHat2 Run 1:No genic structure information (i.e. no GTF file)TopHat2 will find splice junctions on its ownRun this on experimental and control dataRun 2:Genic structure information will be usedRun this on experimental data only
34 Alignment with Tophat2: Run 2 Now we want to start a new informed tophat2 runRNA-Seq Lab | Jessica Kirkpatrick | 2015
35 Aligning reads using gene information Click “Execute” once you have changed the selections shown above.
36 Rename FilesRename your files and make sure they are distinct from the last dataset
38 Comparison of alignments sample #fastq name# readsUnmapped ReadsRun 1Informed run (Run 2)controlthrombin_control.txt10,95310127*experimentalthrombin_expt.txt12,02714739Conclusions* We will not do an informed run on the control data in class. The results of such a run are given.There are fewer unmapped reads with the informed alignment, or Run 2 (i.e. when we use the known genes, and known splice sites)! TopHat’s prediction of splice junctions is not working very well for this dataset. (This is likely due to the low number of reads in our dataset)
40 Tuxedo suite (Cufflinks) The Cufflinks package has 4 components, the 2 major ones are listed below -Cufflinks does reference-based transcriptome assemblyCuffdiff does statistical analysis and identifies differentially expressed transcripts in a simple pairwise comparison, and a series of pairwise comparisons in a time-course experimentTrapnell et al., Nature Protocols, March 2012
41 Assembling transcripts using Cufflinks Run Cufflinks to obtain newly assembled gene transcripts from the aligned RNA-Seq reads.There is no need to conduct this step for the informed alignment (Run 2) because the locations of known genes are known already.
42 Cufflinks: Expt dataClick “Execute” once you have made all the selections.
43 Cufflinks: Control data Now we want to start a new cufflinks run for the control datasetRNA-Seq Lab | Jessica Kirkpatrick | 2015
44 Cufflinks: Control data Now we want to start a new cufflinks run for the control datasetSince this is “re run”, all the parameters should be the same; this makes it easy to replicate runs, and easy to go back and check run parameters.RNA-Seq Lab | Jessica Kirkpatrick | 2015
45 Merging transcripts sets using Cuffmerge Run Cuffmerge in order to merge the assembled transcripts from control and experimental samples. The output of this will be your transcriptome.There is no need to conduct this step for the informed alignment
46 Differential gene expression using Cuffdiff For Run 1 (uninformed) lets find out how many differentially expressed (DE) genes are presentWe need a gene (.gtf) file and both the alignment (.bam) files (control and experimental)We could use Cuffdiff on the informed alignments (run 2) as well, but we normally recommend using htseqcount and edgeR instead
47 Differential gene expression using Cuffdiff Once you have set your specifications, hit executeThis results in many output filesSee the “Outputs” description below the Cuffdiff page for more detailsWe are interested in the differential expressions of genesLook at the last column and count the number of yes’s.
48 Visualization Using IGV .Visualization Using IGVThe Integrative Genomics Viewer (IGV) is a tool that supports the visualization of mapped reads to a reference genome, among other functionalities.RNA-Seq Lab | Jessica Kirkpatrick | 2015
49 Download data Lets compare alignments and GTFs Download 6 files to your computerthrombin_expt_accepted_hitsthrombin_expt_inform_accepted_hitsCuffmerge resultsgenes-chr22.faIndex files for both alignment files
50 Start IGV and load data Load Genome 1. Within IGV, click the FILE tab on the menu bar.2. Click the ‘Load Genome from Server’ option.3. In the browser window, search for “human”, and selectthe hg19 versionLoad Other Files2. Click the ‘Load from File’ option.3. Select the files below (one at a time or use thectrl key to make multiple selections).ctrl_accepted_hits.bamctrl_genes_accepted_hits.bamexpt_accepted_hits.bamexpt_genes_accepted_hits.bamfirst-cuffmerge_merged.gtfgenes-chr22.gtf
51 Visualization with IGV Your browser window should look similar to the picture below:
52 Visualization with IGV Click here and type the following location of a differentially expressed gene:chr22:Move to the left and right of the gene. What do you see?
53 Visualization with IGV Looks like the new transcriptome (first-cuffmerge_merged.gtf) compares poorly to the known gene models. This is very likely due to the very low number of reads in our dataset.We can see that there are many more reads for one dataset compared to the other. Hence, it makes sense that the gene was called as being differentially expressed.Note the intron spanning reads.
54 Today we did the following: ConclusionToday we did the following:Used the Tuxedo Suite to:Aligned RNA-Seq reads using TopHat(splice-aware aligner).Performed reference-based transcriptome assembly with Cufflinks.Obtained a new transcriptome using Cufflinks & Cuffmerge.Used Cuffdiff to obtain a list of differentially expressed genes.Reported a list of significantly expressed genes.Used a genome browser and visualization tool to observe the aligned data and the new transcriptome.
55 Useful links Online resources for RNA-Seq analysis questions – - Biostar (Bioinformatics explained)- SEQanswers (the next generation sequencing community)Most tools have a dedicated listsInformation about the various parts of the Tuxedo suite is available here -Genome Browsers tutorials –- IGV tutorials- UCSC browser tutorials(openhelix is a great place for tutorials, UIUC has a campus-wide subscription)Contact us at: