Al Ritacco, Shailender Nagpal Research Computing

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Advertisements

RNAseq.
Peter Tsai Bioinformatics Institute, University of Auckland
TOPHAT Next-Generation Sequencing Workshop RNA-Seq Mapping
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
RNA-seq Analysis in Galaxy
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.
An Introduction to RNA-Seq Transcriptome Profiling with iPlant
Introduction to RNA-Seq and Transcriptome Analysis
Li and Dewey BMC Bioinformatics 2011, 12:323
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Expression Analysis of RNA-seq Data
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
RNAseq analyses -- methods
Introduction to RNA-Seq & Transcriptome Analysis
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
NGS data analysis CCM Seminar series Michael Liang:
Next Generation DNA Sequencing
Schedule change Day 2: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq) Day 2: PM - RNA-Seq practical (Tophat + Cuffdiff pipeline on Galaxy) Day.
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
Transcriptome Analysis
RNA-Seq in Galaxy Igor Makunin QAAFI, Internal Workshop, April 17, 2015.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Introduction to RNA-Seq
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.
Introduction to RNAseq
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
The iPlant Collaborative
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
The iPlant Collaborative
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
User-friendly Galaxy interface and analysis workflows for deep sequencing data Oskari Timonen and Petri Pölönen.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
Canadian Bioinformatics Workshops
RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Practice:submit the ChIP_Streamline.pbs 1.Replace with your 2.Make sure the.fastq files are in your GMS6014 directory.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
Introductory RNA-seq Transcriptome Profiling
NGS File formats Raw data from various vendors => various formats
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
VCF format: variants c.f. S. Brown NYU
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Introductory RNA-Seq Transcriptome Profiling
Learning to count: quantifying signal
Maximize read usage through mapping strategies
Additional file 2: RNA-Seq data analysis pipeline
Canadian Bioinformatics Workshops
Sequence Analysis - RNA-Seq 2
BF528 - Sequence Analysis Fundamentals
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

Al Ritacco, Shailender Nagpal Research Computing RNA-seq for Transcriptome profiling and discovery of novel transcripts and alternatively spliced variants using HPC Presented by: Al Ritacco, Shailender Nagpal Research Computing UMASS Medical School Information Services, 09/17/2012

Agenda SESSION 1 SESSION 2 What is RNA-seq? Workflow for RNA-seq analysis Tools required SESSION 2 Download and perform QC on sample dataset Mapping and alignment Transcript expression and other “discoveries” Information Services, 00/00/2010

Agenda SESSION 1 SESSION 2 What is RNA-seq? Workflow for RNA-seq analysis Tools required SESSION 2 Download and perform QC on sample dataset Mapping and alignment Transcript expression and other “discoveries” Information Services, 00/00/2010

What is “Next Generation Sequencing”? Set of new high throughput technologies allow millions of short DNA sequences from a biological sample to be “read” or sequenced in a rapid manner Computational power is then used to assemble or align the “reads” to a reference genome, allowing biologists to make comparisons and interpret various biological phenomena Due to high depth of coverage (30-100x), accurate sequencing is obtained much faster and cheaper compared to traditional Sanger/Shotgun sequencing

RNA-Seq experiment Reverse transcription of mRNAs yield double stranded cDNAs, which are sliced to selected fragment length

What is RNA-Seq? RNA-seq is a Next Generation Sequencing (NGS) technology for sequencing total mRNA (“expressed”) in biological samples of interest such as tissues, tumors and cell lines Provides deep coverage and base level resolution Abundance of known transcripts Novel transcripts Alternative splicing events Post-transcriptional mutations Gene fusions

RNA-seq reads: Issue 1 ACTTAAGGCTGACTAGC TCGTACCGATATGCTG Small, single-end reads are hard to align to a reference genome - multiple possible mapping sites. Longer reads can overcome this limitation Paired ends allow for longer fragments to be sequenced, with a small read from each end of the fragment. Distance between ends validates mapping

RNA-seq reads: Issue 2 Reads from one exon are easily mapped Reads that span 2 exons are handled by special software like TopHat Information Services, 00/00/2010

DNA assembly versus alignment DNA assembly is the computational task of putting together pieces of the genome in the original order Overlapping short sequences are extended to form islands, that are subsequently extended and merged Used for denovo sequencing of a new genome DNA alignment is the computational task of mapping short reads to a known, sequenced genome Reads and searched for in the full genome sequence, then aligned with a local alignment algorithm

Agenda SESSION 1 SESSION 2 What is RNA-seq? Workflow for RNA-seq analysis Tools required SESSION 2 Download and perform QC on sample dataset Mapping and alignment Transcript expression and other “discoveries” Information Services, 00/00/2010

RNA-seq workflow summary Sample preparation and submission for experiment. Obtain “read” file(s) as FASTQ Perform QC Perform read alignment and mapping to reference genome Determine expression, novel transcripts and alternative splicing

Step 1: Working with FASTQ Reads After a sample has been processed by an NGS platform, DNA sequence “reads” are provided to the user in FASTQ format FASTQ is a text-based format for storing both a DNA sequence and its corresponding quality scores  sequence letter and quality score are encoded with a single ASCII character for brevity originally developed at the Wellcome Trust Sanger Institute to bundle a FASTA sequence and its quality data recently become the de facto standard for storing the output of high throughput sequencing instruments

Type of reads Single end reads Paired end reads refer to the sequence determined as DNA bases are added to single stranded DNA and detected, usually from one end only Paired end reads refer to the two ends of the same DNA molecule After sequencing one end, you can turn it around and sequence the other end. Long segment of DNA in between the two ends (usually 200-500 bp), who’s sequence is unknown Once the two paired end reads are mapped, the intermediate sequence can be inferred from reference sequence

Quality Score A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect) Two different equations have been in use. The first is the standard Sanger variant to assess reliability of a base call, otherwise known as Phred quality score: The Solexa pipeline earlier used a different mapping, encoding the odds p/(1-p) instead of the probability p: Although both mappings are asymptotically identical at higher quality values, they differ at lower quality levels (i.e., approximately p > 0.05, or equivalently, Q < 13)

Quality Score Encoding Various platform produce versions of FASTQ format, which mainly differ in the Quality score representation Converters from various tools can convert Solexa to Sanger FASTQ format

Step 2: Quality Control FASTQ, PRINSEQ or other custom tools can be used to perform QC on FASTQ files Good tutorial for FASTQC on Youtube: https://www.youtube.com/watch?v=bz93ReOv87Y Information Services, 00/00/2010

Step 3: Alignment/Mapping to Reference Genome Many tools exist that will map the reads to the reference genome and align them, generating a quality score per base of alignment BLAST-like algorithm to search the read in the genome Local alignment to determine the accuracy of the alignment Many tools (SAMtools, MAQ, TopHat, CASSAVA, etc) use the BWA, Bowtie and Eland algorithms.Result of alignment is the SAM file

Alignment algorithms A category of aligners that hash the reads and scan the genome for matches Eland, RMAP, MAQ, ZOOM, SeqMap, CloudBurst, SHRiMP Disadvantages For few reads, whole genome must be scanned Memory footprint is variable

Alignment algorithms (…contd) Another category of aligners hash the reference genome SOAPv1, PASS, MOM, ProbeMatch, NovoAlign, ReSeq, Mosaik, Bfast Easily parallelized with multi-threading, but they usually require large memory to build an index for the human genome Disadvantage: iterative strategy frequently introduced by these software may make their speed sensitive to the sequencing error rate

Alignment algorithms (…contd) A third category which does alignment by merge-sorting the reference subsequences and read sequences Slider Burrows–Wheeler Transform (BWT) for string matching has been incorporated into a new generation of alignment tools SOAPv2, Bowtie and BWA These are memory efficient and suit well to single and paired-end read alignments

Step 4: Transcript expression “Cufflinks”, “myRNA” and other software can estimate the expression level/ abundance of the transcripts Other discoveries possible – novel transcripts, fusions, etc.

Agenda SESSION 1 SESSION 2 What is RNA-seq? Workflow for RNA-seq analysis Tools required SESSION 2 Download and perform QC on sample dataset Mapping and alignment Transcript expression and other “discoveries” Information Services, 00/00/2010

FASTQ manipulation tools Galaxy FASTQ tools https://bitbucket.org/galaxy/galaxy-central/src/tip/tools/fastq FASTXtoolkit http://hannonlab.cshl.edu/fastx_toolkit

QC software FASTqc http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ PRINSEQ http://prinseq.sourceforge.net Information Services, 00/00/2010

Alignment software Single and paired-end alignments can be done using the BWA algorithm in the following software MAQ (http://maq.sourceforget.net) BWA Bowtie TopHat (http://tophat.cbcb.umd.edu) Alignments are produced generally in the widely accepted SAM format

Step 4: Transcript expression Cufflinks myRNA

Agenda SESSION 1 SESSION 2 What is RNA-seq? Workflow for RNA-seq analysis Tools required SESSION 2 Download and perform QC on sample dataset Mapping and alignment Transcript expression and other “discoveries” Information Services, 00/00/2010

Computing Hardware Requirements Two types of computing hardware are ideally suited for NGS data analysis High-end workstation, for example: 64-bit linux, 3.6 GHz quad-core processor, 32 GB RAM, 7200 rpm hard disk High Performance Computing cluster (HPC) where total execution time can be sped up – for example, split reads into small files and align them in parallel on dozens of nodes For this workshop, we will use HPC

RNA-seq datasets NCBI’s Short Read Archive is a good source of datasets. Data can be download from: http://www.ncbi.nlm.nih.gov/sra Look for HapMap, 1000 genomes, cancer samples – all types of experiment types and platforms Combine keywords in search: “Illumina” and “Paired” and “RNA-seq” and “HapMap”

Obtain data from SRA Dataset to be used: Prostate cancer http://www.ncbi.nlm.nih.gov/sra/SRX022065 Tumor-matched Normal http://www.ncbi.nlm.nih.gov/sra/SRX022083 These are two paired-end Solexa datasets with reads split by those belong to the forward or end of a sequence fragment

Setting up the data for analysis Create directory for this dataset cd /home/username mkdir rna-seq cd rna-seq Information Services, 00/00/2010

Load RNA-seq tools Load TopHat, Bowtie, Cufflinks and Samtools module load tophat-1.2.0 module load cufflinks-2.0.2 module load bowtie-0.12.7 module load samtools-0.1.18 module load sratoolkit.2.1.9

Download Genome Annotation (GTF) file The genome annotation file contains exon/intron co-ordinates of reference genome

Download the dataset Download the Paired-end reads for the following library wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR057/SRR057652/SRR057652.sra wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR057/SRR057634/SRR057634.sra

SRA tools to extract FASTQ files Use "sratools" to convert SRA format to FASTQ fastq-dump -A SRR057634 --split-3 SRR057634.sra fastq-dump -A SRR057652 --split-3 SRR057652.sra Information Services, 00/00/2010

Perform QC using FASTQC Load FASTQC module module load fastqc-1.0 Perform QC fastqc –t 8 SRR057634_1.fastq fastqc –t 8 SRR057634_2.fastq This creates 2 directories with output in HTML format for visual inspection in a browser, key statistics and tests are in text file The zip files generated can be deleted

Agenda SESSION 1 SESSION 2 What is RNA-seq? Workflow for RNA-seq analysis Tools required SESSION 2 Download and perform QC on sample dataset Mapping and alignment Transcript expression and other “discoveries” Information Services, 00/00/2010

TopHat-Cufflinks Information Services, 00/00/2010

Alignment and Mapping For RNA-seq, the reads will either map fully to exons or partially to exon-intron junctions, resulting in rejected reads for the latter case Should not use Bowtie or BWA directly for mapping against a reference genome one of the goals is to identify novel transcripts, so we should not use transcriptome as reference TopHat is ideally suited for the job, uses Bowtie for read alignment needs the reference genome and it's annotations as an input. Annotations are optional

Running TopHat for read alignment To run TopHat, execute the following command: tophat --num-threads 8 \ --solexa-quals --max-multihits 10 \ --coverage-search --microexon-search \ --mate-inner-dist 150 \ -o tophat-tumor-out \ --keep-tmp \ –G /usr/public_data/ucsc/genomes/hg19/hg19.gtf \ hg19 \ SRR057634_1.fastq \ SRR057634_2.fastq

Working with TopHat output This produces the “tophat-tumor-out” folder with the following files: accepted_hits.bam junctions.bed insertions.bed deletions.bed

Agenda SESSION 1 SESSION 2 What is RNA-seq? Workflow for RNA-seq analysis Tools required SESSION 2 Download and perform QC on sample dataset Mapping and alignment Transcript expression and other “discoveries” Information Services, 00/00/2010

Reporting quantitative expression: FPKM/RPKM In NGS RNA-seq experiments, quantitative gene expression data is normalized for total gene/transcript length and the number of sequencing reads, and reported as RPKM: Reads Per Kilobase of exon per Million mapped reads. Used for reporting data based on single-end reads FPKM: Fragments Per Kilobase of exon per Million fragments. Used for reporting data based on paired-end fragments

Cufflinks to estimate expression Quantify reference genes and transcripts only cufflinks -p 8 -G hg19.gtf -o cuff-tumor accepted_hits.bam Quantify novel genes & transcripts use hg19 as "guide” cufflinks -p 8 -g hg19.gtf -o cuff-tumor accepted_hits.bam Quantify novel genes & transcripts, "unguided" cufflinks -p 8 -o cuff-tumor accepted_hits.bam

Cufflinks (…contd) This creates the following files transcripts.gtf (Generated annotation) isoforms.fpkm_tracking (Transcript expression) genes.fpkm_tracking (Gene expression) skipped.gtf (Skipped annotations) Depending on the mode in previous step, these files can have vague identifiers (CUFF*) for gene names if known gene annotations are not used. We have to compare with reference annotations to uncover which genes they are "Cuffcompare" allows us to do that

Compare assembled transcripts to reference # Run “cuffcompare” for reference-guided assembly cd cuff2 cuffcompare -r ../../hg19.gtf –R -V transcripts.gtf # This produces the following files in the “cuffcom-out” directory with the following files cuffcmp.transcripts.gtf.tmap, cuffcmp.transcripts.gtf.refmap, cuffcmp.tracking, cuffcmp.stats, cuffcmp.loci, cuffcmp.combined.gtf

Cuffcompare output cuffcmp.stats cuffcmp.combined.gtf Reports statistics related to the "accuracy" of the transcripts when compared to the reference annotation data. Gene finding measures of “sensitivity” and “specificity” are calculated at various levels (nucleotide, exon, intron, transcript, gene) cuffcmp.combined.gtf Reports a GTF file containing the "union" of all transfrags in each sample

Cuffcompare output cuffcmp.loci cuffcmp.tracking Each row contains a transcript structure that is present in one or more input GTF files Column number Column name Example Description 1 Cufflinks transfrag id TCONS_00000045 A unique internal id for the transfrag 2 Cufflinks locus id XLOC_000023 A unique internal id for the locus 3 Reference gene id Tcea The gene_name attribute of the reference GTF record for this transcript, or '-' if no reference transcript overlaps this Cufflinks transcript 4 Reference transcript id uc007afj.1 The transcript_id attribute of the reference GTF record for this transcript, or '-' if no reference transcript overlaps this Cufflinks transcript 5 Class code c The type of match between the Cufflinks transcripts in column 6 and the reference transcript. See class codes

Cuffcompare output cuffcmp.transcripts.gtf.refmap For each input GTF file, it lists the reference transcripts, one row per reference transcript for cufflinks transcripts that either fully or partially match it Column number Column name Example Description 1 Reference gene name uc007crl.1 The gene_name attribute of the reference GTF record for this transcript, if present. Otherwise gene_id is used. 2 Reference transcript id The transcript_id attribute of the reference GTF record for this transcript 3 Class code c The type of match between the Cufflinks transcripts in column 4 and the reference transcript. One of either 'c' for partial match, or '=' for full match. 4 Cufflinks matches CUFF.23567.0,CUFF.24689.0 A comma separated list of Cufflinks transcript ids matching the reference transcript

Cufflinks output (…contd) cuffcmp.transcripts.gtf.tmap For each input GTF file, it lists the most closely matching reference transcript, one row per cufflinks transcript, for each cufflinks transcript (see column definitions on next slide)

Cufflinks output (…contd) Col. Column name Example Description 1 Reference gene name Myog The gene_name attribute of the reference GTF record for this transcript, if present. Otherwise gene_id is used. 2 Reference transcript id uc007crl.1 The transcript_id attribute of the reference GTF record for this transcript 3 Class code c The type of relationship between the Cufflinks transcripts in column 4 and the reference transcript (see Class Codes) 4 Cufflinks gene id CUFF.23567 The Cufflinks internal gene id 5 Cufflinks transcript id CUFF.23567.0 The Cufflinks internal transcript id 6 Fraction of major isoform (FMI) 100 The expression of this transcript expressed as a fraction of the major isoform for the gene. Ranges from 1 to 100. 7 FPKM 1.4567 The expression of this transcript expressed in FPKM 8 FPKM_conf_lo 0.7778 The lower limit of the 95% FPKM confidence interval 9 FPKM_conf_hi 1.9776 The upper limit of the 95% FPKM confidence interval 10 Coverage 3.2687 The estimated average depth of read coverage across the transcript. 11 Length 1426 The length of the transcript 12 Major isoform ID The Cufflinks ID of the gene's major isoform

Cuffcompare output (…contd) Class codes Priority Code Description 1 = Complete match of intron chain 2 c Contained 3 j Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript 4 e Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment. 5 i A transfrag falling entirely within a reference intron 6 o Generic exonic overlap with a reference transcript 7 p Possible polymerase run-on fragment (within 2Kbases of a reference transcript) 8 r Repeat. Currently determined by looking at the soft-masked reference sequence and applied to transcripts where at least 50% of the bases are lower case 9 u Unknown, intergenic transcript 10 x Exonic overlap with reference on the opposite strand 11 s An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors) 12 . (.tracking file only, indicates multiple classifications)

Further analysis The slides so far show how to interpret the RNA-seq results of one sample/library In an actual experiment, tumor-normal pair might be available as separate library. Refer to tophat-cufflinks diagram to choose workflow Might involve cuffmerge prior to cuffcompare Cuffdiff and cuffquant can be used to report expression or even perform differential expression analysis Information Services, 00/00/2010

Q&A Don’t be shy – everyone has a different work flow and we want to help you Information Services, 00/00/2010