Introduction to RNA-Seq and Transcriptome Analysis

Slides:



Advertisements
Similar presentations
RNA-Seq based discovery and reconstruction of unannotated transcripts
Advertisements

NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.
DEG Mi-kyoung Seo.
MCB Lecture #21 Nov 20/14 Prokaryote RNAseq.
RNA-seq Analysis in Galaxy
RIMS II Online Order and Delivery System Tutorial on Downloading and Viewing Multipliers.
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
An Introduction to RNA-Seq Transcriptome Profiling with iPlant
Expression Analysis of RNA-seq Data
Variant Calling Workshop Chris Fields Variant Calling Workshop | Chris Fields | PowerPoint by Casey Hanson.
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
RNAseq analyses -- methods
Introduction to RNA-Seq & Transcriptome Analysis
Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
NGS data analysis CCM Seminar series Michael Liang:
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
Transcriptome Analysis
RNA-Seq in Galaxy Igor Makunin QAAFI, Internal Workshop, April 17, 2015.
RNA-seq workshop ALIGNMENT
An Introduction to RNA-Seq Transcriptome Profiling with iPlant.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Introduction to RNA-Seq
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Chip-Seq Peak Calling in Galaxy | Lisa Stubbs | PowerPoint by Casey Hanson.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Introduction to RNAseq
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
The iPlant Collaborative
An Introduction to RNA-Seq Transcriptome Profiling with iPlant (
No reference available
RNA-Seq in Galaxy Igor Makunin DI/TRI, March 9, 2015.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Lisa Stubbs | Chip-Seq Peak Calling in Galaxy1.
Canadian Bioinformatics Workshops
Introduction to Exome Analysis in Galaxy Carol Bult, Ph.D. Professor Deputy Director, JAX Cancer Center Short Course Bioinformatics Workshops 2014 Disclaimer…I.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
IGV Demo Slides:/g/funcgen/trainings/visualization/Demos/IGV_demo.ppt Galaxy Dev: 0.
Canadian Bioinformatics Workshops
RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -
Visualizing data from Galaxy
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Canadian Bioinformatics Workshops
Bacterial Genome Assembly Tutorial: C. Victor Jongeneel Bacterial Genome Assembly v9 | C. Victor Jongeneel1 Powerpoint: Casey Hanson.
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
Introductory RNA-seq Transcriptome Profiling
GCC Workshop 9 RNA-Seq with Galaxy
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
RNA Sequencing Day 7 Wooohoooo!
Integrative Genomics Viewer (IGV)
NGS Analysis Using Galaxy
Regulatory Genomics Lab
Variant Calling Workshop
Chip – Seq Peak Calling in Galaxy
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Introductory RNA-Seq Transcriptome Profiling
Regulatory Genomics Lab
Additional file 2: RNA-Seq data analysis pipeline
Introduction to RNA-Seq & Transcriptome Analysis
Regulatory Genomics Lab
Chip – Seq Peak Calling in Galaxy
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

Introduction to RNA-Seq and Transcriptome Analysis Bacterial Genome Assembly | Victor Jongeneel Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab | Jessica Kirkpatrick | 2015

Exercise Use the Tuxedo Suite to: Align RNA-Seq reads using TopHat (splice-aware aligner). Perform reference-based transcriptome assembly with Cufflinks. Obtain a new transcriptome using Cufflinks & Cuffmerge. Use Cuffdiff to obtain a list of differentially expressed genes. Report a list of significantly expressed genes. Use a genome browser and visualization tool to observe the aligned data and the new transcriptome.

Cufflinks does reference-based transcriptome assembly Tuxedo Suite Bowtie and Bowtie use Burrows-Wheeler indexing for aligning reads. With bowtie2 there is no upper limit on the read length Tophat uses either Bowtie or Bowtie2 to align reads in a splice-aware manner and aids the discovery of new splice junctions The Cufflinks package has 4 components, the 2 major ones are listed below - Cufflinks does reference-based transcriptome assembly Cuffdiff does statistical analysis and identifies differentially expressed transcripts in a simple pairwise comparison, and a series of pairwise comparisons in a time-course experiment Trapnell et al., Nature Protocols, March 2012

Premise 1. Procedure: Run 1: Allow TopHat to select splice junctions and proceed through the steps without giving the software any information about known genes/gene models. Run 2: Force TopHat to use only known splice junctions (i.e. known genes/gene models) and proceed through the steps making sure we are doing our analysis in the context of these gene models. 2. Evaluation: a. 2 metrics: # of mapped reads and # of significantly different identified genes b. Compare new transcriptome to known genes. Question: Is there a difference in the results if the Tuxedo Suite is run 2 different ways?

Premise VS

Input data RNA-Seq: 100 bp, single end data Genome & gene information: sample replicate # fastq name # reads control Replicate 1 thrombin_control.fastq 10,953 experiment thrombin_expt.fastq 12,027 Genome & gene information: name description chr22.fa Fasta file with the sequence of chromosome 22 from the human genome (hg19 – UCSC) (reference genome) genes-chr22.gtf GTF file with gene annotation, known genes (hg19 – UCSC)

Sign in to Galaxy Go to https://galaxy.illinois.edu Click on the button Sign in using your classroom ID and password

How Galaxy works with the biocluster MAY WANT TO DELETE THIS Biocluster Signing up - http://biocluster.igb.illinois.edu/ Usage and cost - http://help.igb.illinois.edu/Biocluster Christopher Fields

Rename the History

Accessing the input files The data are located in the following directory: /home/classroom/rnaseq-mayo/ The rnaseq-mayo directory contains an input_data folder as well as a results folder. (Note “~” is a symbol in UNIX paths referring to your home directory). $ mkdir rnaseq-mayo # Make a working directory in your home directory. $ cp /home/classroom/rnaseq-mayo/input_data/* ~/rnaseq-mayo/ # Copy data to your working directory. $ qsub -I -q classroom -l nodes=1;ppn=4 # Login to a “classroom” computer on the cluster with 4 processors and in an interactive mode.

Getting data into Galaxy (Method 4) Click on the “Shared Data” pulldown menu Click on “Published Histories”

Getting data Click on the “Workshop FASTQs”

Getting data Click on the “Import History” on the top, towards the right

Getting data

Now your current history is the imported history, called “imported: RNA-Seq Chr 22 Data” In the top right corner of the history panel is a wheel, click on that wheel

Getting data The pulldown menu that is revealed when you click on the wheel has many options that are worth exploring… Right now we are interested in the “Copy Datasets” option Basically, we want to copy the data we have in this imported history to our previously created “RNA-Seq workshop” history

Getting data into Galaxy (Method 4) For your “Source History”, select the imported one and for your “Destination History”, select the RNA-Seq workshop Select all the datasets that you want to copy to the “RNA-Seq workshop” history Click on “Copy History Items”

Getting data

A glimpse at the input data FASTA chr22.fa GTF genes-chr22.gtf FASTQ thrombin_expt.fastq thrombin_control.fastq

RNA-Seq Lab | Jessica Kirkpatrick | 2015 RUN 1: Alignment RNA-Seq Lab | Jessica Kirkpatrick | 2015

Aligning reads using TopHat We are not going to provide any genic structure information. TopHat will find splice junctions on its own.

Aligning reads using TopHat Always read the instructions before running software In the left tools panel search for tophat2 Click on tophat2, this will result in the central panel showing you all the options for tophat2 Remember you need the quality values in your fastq to be phred 33, or Sanger scores

Aligning reads using TopHat2 Run 1: No genic structure information (i.e. no GTF file) TopHat2 will find splice junctions on its own Run this on experimental & control data. Run 2: Genic structure information will be used Run this on experimental data.

Alignment with Tophat2: Run 1 In the left tools panel search for tophat2 Click on tophat2, this will result in the central panel showing you all the options for tophat2 Remember you need the quality values in your fastq to be phred 33, or Sanger scores RNA-Seq Lab | Jessica Kirkpatrick | 2015

Alignment with Tophat2: Run 1 RNA-Seq Lab | Jessica Kirkpatrick | 2015

Alignment with Tophat2: Run 1 Ask about split-segment RNA-Seq Lab | Jessica Kirkpatrick | 2015

Alignment with Tophat2: Run 1 Click “Execute” once you have made all the selections. RNA-Seq Lab | Jessica Kirkpatrick | 2015

Alignment with Tophat2: Run 1 Now we want to start a new tophat2 run for another fastq file in the RNA-Seq workshop history RNA-Seq Lab | Jessica Kirkpatrick | 2015

Alignment with Tophat2: Run 1 Now we want to start a new tophat2 run for the control fastq file in the RNA-Seq workshop history Since this is “re run”, all the parameters should be the same; this makes it easy to replicate runs, and easy to go back and check run parameters. Always re-label new files immediately with names that makes sense to you, by clicking on the pencil and changing attributes RNA-Seq Lab | Jessica Kirkpatrick | 2015

Rename Files On Galaxy its important to rename your files to something meaningful

Evaluating alignment: Run 1 How many reads DID NOT align to the reference genome chr22?

Run 2: Informed Alignment . Run 2: Informed Alignment RNA-Seq Lab | Jessica Kirkpatrick | 2015

Aligning reads using TopHat2 Run 1: No genic structure information (i.e. no GTF file) TopHat2 will find splice junctions on its own Run this on experimental and control data Run 2: Genic structure information will be used Run this on experimental data only

Alignment with Tophat2: Run 2 Now we want to start a new informed tophat2 run RNA-Seq Lab | Jessica Kirkpatrick | 2015

Aligning reads using gene information Click “Execute” once you have changed the selections shown above.

Rename Files Rename your files and make sure they are distinct from the last dataset

Evaluating alignment: Run 2

Comparison of alignments sample # fastq name # reads Unmapped Reads Run 1 Informed run (Run 2) control thrombin_control.txt 10,953 101 27* experimental thrombin_expt.txt 12,027 147 39 Conclusions * We will not do an informed run on the control data in class. The results of such a run are given. There are fewer unmapped reads with the informed alignment, or Run 2 (i.e. when we use the known genes, and known splice sites)! TopHat’s prediction of splice junctions is not working very well for this dataset. (This is likely due to the low number of reads in our dataset)

Finding Differentially Expressed Genes . Finding Differentially Expressed Genes RNA-Seq Lab | Jessica Kirkpatrick | 2015

Tuxedo suite (Cufflinks) The Cufflinks package has 4 components, the 2 major ones are listed below - Cufflinks does reference-based transcriptome assembly Cuffdiff does statistical analysis and identifies differentially expressed transcripts in a simple pairwise comparison, and a series of pairwise comparisons in a time-course experiment Trapnell et al., Nature Protocols, March 2012

Assembling transcripts using Cufflinks Run Cufflinks to obtain newly assembled gene transcripts from the aligned RNA-Seq reads. There is no need to conduct this step for the informed alignment (Run 2) because the locations of known genes are known already.

Cufflinks: Expt data Click “Execute” once you have made all the selections.

Cufflinks: Control data Now we want to start a new cufflinks run for the control dataset RNA-Seq Lab | Jessica Kirkpatrick | 2015

Cufflinks: Control data Now we want to start a new cufflinks run for the control dataset Since this is “re run”, all the parameters should be the same; this makes it easy to replicate runs, and easy to go back and check run parameters. RNA-Seq Lab | Jessica Kirkpatrick | 2015

Merging transcripts sets using Cuffmerge Run Cuffmerge in order to merge the assembled transcripts from control and experimental samples. The output of this will be your transcriptome. There is no need to conduct this step for the informed alignment

Differential gene expression using Cuffdiff For Run 1 (uninformed) lets find out how many differentially expressed (DE) genes are present We need a gene (.gtf) file and both the alignment (.bam) files (control and experimental) We could use Cuffdiff on the informed alignments (run 2) as well, but we normally recommend using htseqcount and edgeR instead

Differential gene expression using Cuffdiff Once you have set your specifications, hit execute This results in many output files See the “Outputs” description below the Cuffdiff page for more details We are interested in the differential expressions of genes Look at the last column and count the number of yes’s.

Visualization Using IGV . Visualization Using IGV The Integrative Genomics Viewer (IGV) is a tool that supports the visualization of mapped reads to a reference genome, among other functionalities. RNA-Seq Lab | Jessica Kirkpatrick | 2015

Download data Lets compare alignments and GTFs Download 6 files to your computer thrombin_expt_accepted_hits thrombin_expt_inform_accepted_hits Cuffmerge results genes-chr22.fa Index files for both alignment files

Start IGV and load data Load Genome 1. Within IGV, click the FILE tab on the menu bar. 2. Click the ‘Load Genome from Server’ option. 3. In the browser window, search for “human”, and select the hg19 version Load Other Files 2. Click the ‘Load from File’ option. 3. Select the files below (one at a time or use the ctrl key to make multiple selections). ctrl_accepted_hits.bam ctrl_genes_accepted_hits.bam expt_accepted_hits.bam expt_genes_accepted_hits.bam first-cuffmerge_merged.gtf genes-chr22.gtf

Visualization with IGV Your browser window should look similar to the picture below:

Visualization with IGV Click here and type the following location of a differentially expressed gene: chr22:19960675-19963235 Move to the left and right of the gene. What do you see?

Visualization with IGV Looks like the new transcriptome (first-cuffmerge_merged.gtf) compares poorly to the known gene models. This is very likely due to the very low number of reads in our dataset. We can see that there are many more reads for one dataset compared to the other. Hence, it makes sense that the gene was called as being differentially expressed. Note the intron spanning reads.

Today we did the following: Conclusion Today we did the following: Used the Tuxedo Suite to: Aligned RNA-Seq reads using TopHat(splice-aware aligner). Performed reference-based transcriptome assembly with Cufflinks. Obtained a new transcriptome using Cufflinks & Cuffmerge. Used Cuffdiff to obtain a list of differentially expressed genes. Reported a list of significantly expressed genes. Used a genome browser and visualization tool to observe the aligned data and the new transcriptome.

Useful links Online resources for RNA-Seq analysis questions – http://www.biostars.org/ - Biostar (Bioinformatics explained) http://seqanswers.com/ - SEQanswers (the next generation sequencing community) Most tools have a dedicated lists Information about the various parts of the Tuxedo suite is available here - http://ccb.jhu.edu/software.shtml Genome Browsers tutorials – http://www.broadinstitute.org/igv/QuickStart/ - IGV tutorials http://www.openhelix.com/ucsc/ - UCSC browser tutorials (openhelix is a great place for tutorials, UIUC has a campus-wide subscription) Contact us at: hpcbiohelp@illinois.edu hpcbiotraining@illinois.edu