RNA-Seq and Transcriptome Analysis

RNA-Seq and Transcriptome Analysis
Jessica R. Holmes Research & Instructional Specialist in Life Sciences High Performance Biological Computing (HPCBio) Roy J. Carver Biotechnology Center

General Outline Getting the RNA-Seq data: from RNA -> Sequence data
Experimental and practical considerations Commonly encountered file formats Transcriptomic analysis methods and tools Transcriptome Assembly Differential Gene expression

Transcriptome Sequencing (aka RNA-Seq)
Why sequence RNA? Differential Gene Expression Quantitative evaluation and comparison of transcript levels, usually between different groups Vast majority of RNA-Seq is for DGE Transcriptome Assembly Build new or improved profile of transcribed regions (“gene models”) of the genome Can then be used for DGE Metatranscriptomics Transcriptome analysis of a community of different species (e.g., gut bacteria, hot springs, soil) Gain insights on the functioning and activity rather than just who is present

Types of RNA Ribosomal (rRNA) Messenger (mRNA ) Micro (miRNA)
Responsible for protein synthesis 60% of a ribosome; up to 95% of total RNA in a cell Messenger (mRNA ) Translated into protein in ribosome 3-4% of total RNA in a cell have poly-A tails in eukaryotes Micro (miRNA) short (22 bp) non-coding RNA involved in expression regulation Transfer (tRNA) Bring specific amino acids for protein synthesis Others (lncRNA, shRNA, siRNA, snoRNA, etc.)

Experimental and Practical considerations
RNA-Seq Experimental and Practical considerations Usually don’t want to sequence rRNA, so need to select for other RNA types by: poly-A selection (eukaryotes only) ribosomal depletion Size selection

Illumina Sequencing Workflow
Fragment DNA Repair ends Add A overhang Ligate adapters Purify Library Preparation 1 6

From RNA -> sequence data
Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682

How do we sequence DNA? 1st generation: Sanger method (1987)
2nd generation (“next generation”; 2005): 454 - pyrosequencing SOLiD – sequencing by ligation Illumina – sequencing by synthesis Ion Torrent – ion semiconductor Pac Bio – Single Molecule Real-Time sequencing, 1000 bp 3rd generation (2015) Pac Bio – SMRT, Sequel system, 20,000 bp Nanopore – ion current detection 10X Genomics – novel library prep for Illumina

Illumina – “short read” sequencing
Rapid improvements over the years from 36 bp to 300 bp; highest throughput at 100 bp; many different types of sequencers for various applications. Can also “flip” a longer DNA strand and sequence from the other end to get paired-end reads Accuracy: 99.99% Biases: yes Most common platform for transcriptome sequencing …………….. Single-read 100nt Paired-end

Illumina Sequencing Workflow
Fragment DNA Repair ends Add A overhang Ligate adapters Purify Library Preparation 1 Cluster Generation Hybridize to flow cell Extend hybridized template Perform bridge amplification Prepare flow cell for sequencing 2 Sequencing Perform sequencing Generate base calls 3 Data Analysis Images Intensities Reads Alignments 4 12

Illumina Sequencing Technology Workflow
Single molecule array Cluster Growth Sequencing 5’ 3’ DNA ( μg) C A T G T G T A C Library Preparation Image Acquisition 1 2 3 7 8 9 4 5 6 Base Calling T G T A C G A T … 13

Illumina Sequencing Video

Experimental and Practical considerations Commonly encountered file formats Transcriptomic analysis methods and tools Transcriptome Assembly Differential Gene expression

RNA-Seq Experimental and Practical considerations Experimental design Technical replicates Illumina has low technical variation unlike microarrays Technical replicates are unnecessary Batch effects Best to sequence everything for an experiment at the same time If you are preparing the libraries, be consistent & make them simultaneously Biological replicates This is essential for your experiment to have any statistical power At least 3, but the more the better

RNA-Seq Experimental and Practical considerations Experimental design For transcriptome assembly RNA can be pooled from various sources to ensure the most robust transcriptome Pooling can also be done after sequencing, but before assembly For differential gene expression Pooling RNA from multiple biological replicates is usually not advisable

RNA-Seq Experimental and Practical considerations Poly(A) enrichment or ribosomal RNA depletion? Unless interested in rRNA, it’s always best to remove Remember it makes up ~ 95% of the transcriptome Use Poly(A) enrichment if you are only interested in coding RNA This is very common, but do not use if interested in non- coding RNA as well For metatranscriptomics, you should also remove host DNA, but this can be done with bioinformatics tools downstream

RNA-Seq Experimental and Practical considerations Single-end (SE) or Paired end (PE)? SE is most common for DGE analysis PE is best for assemblies, for isoform differentiation, and for paralogous & orthologous gene differentiation (i.e. high-ploidy genomes & metatransciptomes) Single-end read Read1 ATGTTCCATAAGC… Paired-end reads Read1 ATGTTCCATAAGC… Read2 CCGTAATGGCATG…

RNA-Seq Experimental and Practical considerations Stranded? Most RNA-Seq library preparation kits produce stranded libraries Exception: single cell prep kits may be unstranded Can identify which strand of DNA the RNA was transcribed from Strandedness is advisable for all applications Three types of libraries Unstranded – The strand of DNA used to transcribe the reads is unknown Reverse – Reads were transcribed from the strand with complementary sequence to the reads Forward – Reads were transcribed from the strand that has a sequence identical to the reads

From RNA -> sequence data
Uracil DNA Glycosylase Uracil DNA Glycosylase Borodina T., Methods in Enzymology (2011) 500:79–98

Experimental and Practical considerations Commonly encountered file formats Transcriptomic analysis methods and tools Transcriptome Assembly Differential Gene expression

File formats Sequence formats FASTA FASTQ Feature formats GFF GTF
A brief note Sequence formats FASTA FASTQ Feature formats GFF GTF Alignment formats SAM BAM

Formats: FASTA Deceptively simple format (e.g. there is no standard)
>unique_sequence_ID My sequence is pretty cool ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAAAA Deceptively simple format (e.g. there is no standard) However in general: Header line, starts with ‘>’ followed directly by an ID … and an optional description (separated by a space) Files can be fairly large (whole genomes) Any residue type (DNA, RNA, protein), but simple alphabet

Formats: FASTA E.g. a read E.g. a chromosome >unique_sequence_ID
ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAAAATTTATGATAAAA >Group10 gi| |ref|NC_ | Amel_4.5, whole genome shotgun sequence TAATTTATATATCTATTTTTTTTATTAAAAAATTTATATTTTTGTTAAAATTTTATTTGATTAGAAATAT TTTTACTATTGTTCATTAATCGTTAATTAAAGATAGCACAGCACATGTAAGAATTCTAGGTCATGCGAAA TTAAAAATTAAAAATATTCATATTTCTATAATAATTAAATTATTGTTTTAATTTAAGTAAAAAAATTTCT AAGAAATCAAAAATTTGTTGTAATATTGAAACAAAATTTTGTTGTCTGCTTTTTATAGTAACTAATAAAT ATTTAATAAAAAATTACTTTATTTAATATTTTATAATAAATCAAATTGTCCAATTTGAAATTTATTTTAT CACTAAAAATATCTTTATTATAGTCAATATTTTTTGTTAGGTTTAAATAATTGTTAAAATTAGAAAATGA TCGATATTTTCAAATAGTACGTTTAACTAATACTTAAGTGAAAGGTAAAGCGGTTATTTAAAATATTGAT TTATAATATTCGTGACATAATATATTTATAAATAGATTATATATATATATATACATCAAAATATTATACG AGAACTAGAAAATATTACAGATGCAAAATAAATTAAATTTTGTAAATGTTACAGAATTAAAAATCGAAGT

Formats: FASTQ FASTQ – FASTA with quality
@unique_sequence_ID ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAAAATTTATGATAAAA + DNA sequence with quality metadata The header line, starts with directly by an ID and an optional description (separated by a space) May be ‘raw’ data (straight from sequencing) or processed (trimmed) Variations: Sanger, Illumina, Solexa (Sanger is most common) Can hold 100’s of millions of records Files can be very large - 100’s of GB apiece

“Phred” quality (Q) scores
Whole-genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing Akihiro Fujimoto, Hidewaki Nakagawa, Naoya Hosono, Kaoru Nakano, Tetsuo Abe, Keith A Boroevich, Masao Nagasaki, Rui Yamaguchi, Tetsuo Shibuya, Michiaki Kubo, Satoru Miyano, Yusuke Nakamura & Tatsuhiko Tsunoda Nature Genetics, 2010 Q score Prob. of wrong call Accuracy 10 1 in 10 (0.1) 90% 20 1 in 100 (0.01) 99% 30 1 in 1000 (0.001) 99.9% 40 1 in (0.0001) 99.99%

Formats: FASTQ Sanger = Illumina 1.8+ FASTQ – FASTA with quality
@unique_sequence_ID ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAAAATTTATGATAAAA+unique_sequence_ID Sanger = Illumina 1.8+

Feature formats GTF/GFF3 SAM/BAM UCSC formats (BED, WIG, etc.)

Feature formats Used for mapping features against a particular sequence or genome assembly May or may not include sequence data The reference sequence must match the names from a related file (possibly FASTA) These are version (assembly)-dependent - they are tied to a specific version (assembly/release) of a reference genome Not all reference genomes are the represented the same! E.g. human chromosome 1 UCSC – ‘chr1’ Ensembl/NCBI – ‘1’ Best practice: get these from the same source as the reference

Feature formats : GTF Gene transfer format
Apr-18 Feature formats : GTF Gene transfer format Differences in representation of information make it distinct from GFF AB Twinscan CDS gene_id "001"; transcript_id "001.1"; AB Twinscan CDS gene_id "001"; transcript_id "001.1"; AB Twinscan CDS gene_id "001"; transcript_id "001.1"; AB Twinscan start_codon gene_id "001"; transcript_id "001.1"; AB Twinscan stop_codon gene_id "001"; transcript_id "001.1"; Source End location Strand Attributes (hierarchy) Chromosome ID Start location Reading frame Gene feature Score (user defined)

Feature formats : GTF Gene transfer format
Apr-18 Feature formats : GTF Gene transfer format Differences in representation of information make it distinct from GFF Source of GTF is important – Ensembl GTF is not quite the same as UCSC GTF AB Twinscan CDS gene_id "001"; transcript_id "001.1"; AB Twinscan CDS gene_id "001"; transcript_id "001.1"; AB Twinscan CDS gene_id "001"; transcript_id "001.1"; AB Twinscan start_codon gene_id "001"; transcript_id "001.1"; AB Twinscan stop_codon gene_id "001"; transcript_id "001.1"; *********Could add this last point to previous slide Source End location Strand Attributes (hierarchy) Chromosome ID Start location Reading frame Gene feature Score (user defined)

Feature formats : GFF3 Gene feature format (v3)
Tab-delimited file to store genomic features, e.g. genomic intervals of genes and gene structure Meant to be unified replacement for GFF/GTF (includes specification) All but UCSC have started using this (UCSC prefers their own internal formats) Chr1 amel_OGSv gene ID=GB42165 Chr1 amel_OGSv mRNA ID=GB42165-RA;Parent=GB42165 Chr1 amel_OGSv ’UTR Parent=GB42165-RA Chr1 amel_OGSv exon Parent=GB42165-RA Chr1 amel_OGSv exon Parent=GB42165-RA Source End location Strand Attributes (hierarchy) Chromosome ID Start location Gene feature Phase Score (user defined)

Feature formats: GFF3 vs. GTF
Apr-18 Feature formats: GFF3 vs. GTF GFF3 – Gene feature format GTF – Gene transfer format Always check which of the two formats is accepted by your application of choice, sometimes they cannot be swapped Chr1 amel_OGSv gene ID=GB42165 Chr1 amel_OGSv mRNA ID=GB42165-RA;Parent=GB42165 Chr1 amel_OGSv ’UTR Parent=GB42165-RA Chr1 amel_OGSv exon Parent=GB42165-RA Chr1 amel_OGSv exon Parent=GB42165-RA AB Twinscan CDS gene_id "001"; transcript_id "001.1"; AB Twinscan CDS gene_id "001"; transcript_id "001.1"; AB Twinscan CDS gene_id "001"; transcript_id "001.1"; AB Twinscan start_codon gene_id "001"; transcript_id "001.1"; AB Twinscan stop_codon gene_id "001"; transcript_id "001.1";

Apr-18 What is an alignment? Wikipedia - “a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences” How can we store this information about millions of reads that align to our reference genome?

Formats : SAM SAM – Sequence Alignment/Map format
SAM file format stores alignment information Plain text Specification: Contains quality information, meta data, alignment information, sequence etc. Files can be very large: Many 100’s of GB or more Normally converted into BAM to save space (and text format is mostly useless for downstream analyses) @HD [format version] @SQ SN:chr_1 LN: @PG [information about program that made this] HWI-D00758:59:C7U2JANXX:1:1101:1398: chr_ S9M * NAGCTCTTTA #/<<BFBBFF NH:i:1 HI:i:1 AS:i:93 nM:i:2

Formats : BAM BAM – BGZF compressed SAM format
Compressed/binary version of SAM and is not human readable. Uses a specialized compression algorithm optimized for indexing and record retrieval (bgzip) Makes the alignment information easily accessible to downstream applications (large genome file not necessary) Unsorted, sorted by sequence name, sorted by genome coordinates May be accompanied by an index file (.bai) (only if coordinate sorted) Files are typically very large: ~ 1/5 of SAM, but still very large

General Outline 4. Transcriptomic analysis methods and tools
Transcriptome Analysis; aspects common to both assembly and differential gene expression Download data Quality check Data alignment Assembly Differential Gene Expression Choosing a method, the considerations… Final thoughts and observations

Obtain sequence data If you are using the R.J.C. Biotechnology Center and the Biocluster Globus is most direct route CNRG instructions Download data to a computer and upload to Biocluster using an SFTP client Cyberduck, WinSCP… Can also use linux commands such as: scp, rsync, wget, …

Globus

Quality Checks How do my newly obtained data look?
Check for overall data quality. FastQC is a great tool that enables the quality assessment. Good quality! Poor quality!

Quality Checks How do my newly obtained data look?
Check for overall data quality. FastQC is a great tool that enables the quality assessment. In addition to the quality of each sequenced base, it will give you an idea of Presence of, and abundance of contaminating sequences. Average read length GC content NOTE – FastQC is good, but it is very strict and will not hesitate to call your dataset bad on one of the many metrics it tests the raw data for. Use logic and read the explanation for why and if it is acceptable.

Quality Checks What do I do when FastQC calls my data poor?
Poor quality at the ends can be remedied “quality trimmers” like trimmomatic, fastx-toolkit, etc. Left-over adapter sequences in the reads can be removed “adapter trimmers” like trimmomatic. Always trim adapters as a matter of routine We need to amend these issues so we get the best possible alignment After trimming, it is best to rerun the data through FastQC to check the resulting data

Transcriptome Analysis
Quality Checks Before quality trimming After quality trimming

Data Alignment We need to align the sequence data to our genome of interest If aligning RNASeq data to the genome, almost always pick a splice-aware aligner Genome Gene Reads Versus Alignment Splice-Aware

Data Alignment We need to align the sequence data to our genome of interest If aligning RNA-Seq data to the genome, always pick a splice-aware aligner (unless it’s a bacterial genome!) HiSat2, TopHat2, STAR, MapSplice, SOAPSplice, Passion, SpliceMap, RUM, ABMapper, CRAC, GSNAP, HMMSplicer … There are excellent aligners available that are not splice-aware. This is ideal for bacterial genomes. BWA, Novoalign (not free), Bowtie2, SOAPaligner, HiSat2

Data Alignment Other considerations when choosing an aligner: How does it deal with reads that map to multiple locations? How does it deal with paired-end versus single-end data? How many mismatches will it allow between the genome and the reads? What assumptions does it make about my genome, and can I change these assumptions?

Data Alignment Differential gene expression IGV is the visualization tool used for this snapshot

General Outline 4. Transcriptomic analysis methods and tools

Transcriptome Assembly Overview
Two main types of assembly Reference-based assembly A de novo assembly Differential gene expression

Transcriptome Assembly
Reference-based assembly Used when the genome reference sequence is known, and: Transcriptome data is not available Transcriptome data is available but not good enough, i.e. missing isoforms of genes, or unknown non-coding regions The existing transcriptome information is for a different tissue type Stringtie, Cufflinks and Scripture are two reference-based transcriptome assemblers

Reference-based assembly Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682

De novo assembly Used when very little information is available for the genome Often the first step in putting together information about an unknown genome Amount of data needed for a good de novo assembly is higher than what is needed for a reference-based assembly Can be used for genome annotation, once the genome is assembled Trinity, and TransABySS, are examples of well-regarded transcriptome assemblers Transcriptome assembly, de novo, reference-based

De novo assembly (De Bruijn graph construction) Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682

Combined Transcriptome Assembly
Align-then-assemble preferred for a trusted reference genome & for filtering out pathogen RNA Assemble-then-align preferred when reference genome is not trusted or is a closely related species to the reference Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682

How good is my assembly? Are all the genes I expected in the assembly?
Do I have complete genes? Are the contigs assembled correctly? How does it look compared to a close reference?

Tools for Evaluating Assembly: using the information you have
TransRate – evaluates assembly using reads, paired end information, reference genome, protein data, etc. Can generate a ‘cleaned-up’ or optimized assembly based on metrics DETONATE (from the RSEM developers) – evaluates assembly based on read mapping and/or reference information

Tools for Evaluating Assembly: conserved gene sets
CEGMA : From Ian Korf’s group, UC Davis Mapping Core Eukaryotic Genes - Discontinued! BUSCO: From Evgeny Zdobnov’s group, University of Geneva Coverage is indicative of quality and completeness of assembly

Outline Transcriptomic analysis methods and tools
Transcriptome Analysis; aspects common to both assembly and differential gene expression Quality check Data alignment Assembly Differential Gene Expression Choosing a method, the considerations… Final thoughts and observations

Differential Gene Expression Overview
Obtain/download sequence data Check quality of data and Trim low quality bases, and remove adapter sequence Align trimmed reads to genome of interest Pick alignment tool Index genome file Run alignment after choosing the relevant parameters If you are not studying human or mouse, check every parameter and confirm that they make the correct assumptions for your genome! Otherwise, change them Differential gene expression

Differential Gene Expression overview
Set up to do differential gene expression (DGE) Identify read counts associated with genes Do you want to obtain raw read counts or normalized read counts? This will depend on the statistical analysis you wish to perform downstream htseq & feature-counts return raw read counts Required for R programs like DESeq & EdgeR Ballgown & Cufflinks return FPKM normalized counts for each gene Differential gene expression

Differential Gene Expression
Options for DGE analysis

DGE Statistical Analyses
The first step is proper normalization of the data Often the statistical package you use will have a normalization method that it prefers and uses exclusively (e.g. Voom, FPKM, TMM (used by EdgeR)) Is your experiment a pairwise comparison? Ballgown/Cufflinks, EdgeR, DESeq Is it a more complex design? EdgeR, DESeq, other R/Bioconductor packages

How does one pick the right tools?

What does HPCBio use? Quality Check - FASTQC Trimming - Trimmomatic
Splice-aware alignment - STAR Bacterial alignment - BWA or Novoalign Counting reads per gene - featureCounts DGE Analysis - EdgeR or DESeq De novo transcriptome assembly - Trinity & RSEM

You won’t be using these tools today 
Takes a lot of time to learn We do offer a long workshop on these methods during Spring break every year Check at the beginning of the year for updates

For today’s lab, you will use Tuxedo Suite
Easier and faster to learn Requires only a basic background in Linux (maybe R) Available on Galaxy, an online set of tools that’s ideal for small projects and teaching! Great option for simple statistical models Can perform all steps for DGE analysis very quickly Can do splice-aware & regular alignments (no need to switch software for bacteria) Can perform reference-based transcriptome assembly Does DGE analysis and creates statistical graphs

Why doesn’t HPCBio use Tuxedo suite?
Very new. Has only been tested/optimized with human data We work on a wide variety of organisms, and STAR works very well for us. No need to change yet… Can’t handle complex statistical models No more complicated than a pairwise-comparison We need more flexibility Doesn’t output raw read counts easily This is needed for use of R packages like DESeq & EdgeR, which can handle complex models

Differential Gene Expression
Bowtie/Bowtie2 use Burrows-Wheeler indexing for aligning reads (not splice-aware) Tophat/Tophat2 utilizes either Bowtie or Bowtie2 to align reads in a splice-aware manner and aids the discovery of new splice junctions The Cufflinks package has 4 components, the 2 major ones are listed below: Cufflinks does reference-based transcriptome assembly Cuffdiff does DGE analysis in a simple pairwise comparison, and a series of pairwise comparisons in a time-course experiment Options for DGE analysis (tuxedo suite) Trapnell et al., Nature Protocols, March 2012

Old Tuxedo suite NEW Tuxedo suite
Bowtie/Bowtie2 - general aligner Tophat/Tophat2 - splice-aware aligner Cufflinks suite - Transcriptome assembly, gene counting, and DGE calculation for simple models This suite is well-known, but there are a few issues with methods used No longer being supported! HISAT2 - DNA & RNA aligner in one StringTie - Transcriptome assembly, & gene counting Ballgown - Improved DGE calculation for simple models, and visualization This suite is very new and attempts to improve on the old suite’s issues However, not well-tested on anything but human at the moment

Old Tuxedo Suite Protocol
.fastq .bam .gtf or .gff3 Text A single merged gtf Trimmed sequence FASTQ files Alignment file Gene annotation file .gtf or .gff3 Alignment file Trapnell et al., Nature Protocols, March 2012

New Tuxedo Suite Protocol
Trimmed sequence FASTQ files Alignment file Gene annotation file Pertea, et al., Nature Protocols, August 2016

Note about today’s lab Unfortunately, the new Tuxedo suite isn’t fully available on Galaxy Hopefully this will change soon We will use the old Tuxedo suite Steps will be very similar to the new one

Final thoughts and stray observations
Think carefully about what your experimental goals are before designing your experiment and choosing your bioinformatics tools

Think carefully about what your experimental goals are before designing your experiment and choosing your bioinformatics tools When in doubt “Google it” and ask questions. - Biostar (Bioinformatics explained) - SEQanswers (the next generation sequencing community)

Think carefully about what your experimental goals are before designing your experiment and choosing your bioinformatics tools When in doubt “Google it” and ask questions. - Biostar (Bioinformatics explained) - SEQanswers (the next generation sequencing community) Another good resource if you are not ready to use the command line routinely is Galaxy. It is a web-based bioinformatics portal that can be locally installed, if you have the necessary computational infrastructure.

Today we covered how to deal with Illumina data, but you may also encounter long-read data as well Hybrid transcriptome assemblies can be done, but are usually challenging

Today we covered how to deal with Illumina data, but you may also encounter long-read data as well Hybrid assemblies can be done, but are usually challenging R is an excellent language to learn, if you are interested in performing in- depth statistical analyses for differential gene expression analysis Not within the scope of this lecture/lab section We do offer a long RNA-Seq workshop that covers the “HPCBio” RNA-Seq pipeline:

Documentation and Support
Online resources for RNA-Seq analysis questions – Software manuals - Biostar (Bioinformatics explained) - SEQanswers (the next generation sequencing community) Most tools have a dedicated lists/forums Thanks Contact us at: See website for upcoming workshops & services:

Thank you for your attention!
Thanks

RNA-Seq and Transcriptome Analysis

Similar presentations

Presentation on theme: "RNA-Seq and Transcriptome Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RNA-Seq and Transcriptome Analysis

Similar presentations

Presentation on theme: "RNA-Seq and Transcriptome Analysis"— Presentation transcript:

Similar presentations

About project

Feedback