Download presentation
Presentation is loading. Please wait.
1
RNA-Seq and Transcriptome Analysis
Jessica R. Holmes Research & Instructional Specialist in Life Sciences High Performance Biological Computing (HPCBio) Roy J. Carver Biotechnology Center
2
General Outline Getting the RNA-Seq data: from RNA -> Sequence data
Experimental and practical considerations Commonly encountered file formats Transcriptomic analysis methods and tools Transcriptome Assembly Differential Gene expression
3
Transcriptome Sequencing (aka RNA-Seq)
Why sequence RNA? Differential Gene Expression Quantitative evaluation and comparison of transcript levels, usually between different groups Vast majority of RNA-Seq is for DGE Transcriptome Assembly Build new or improved profile of transcribed regions (“gene models”) of the genome Can then be used for DGE Metatranscriptomics Transcriptome analysis of a community of different species (e.g., gut bacteria, hot springs, soil) Gain insights on the functioning and activity rather than just who is present
4
Types of RNA Ribosomal (rRNA) Messenger (mRNA ) Micro (miRNA)
Responsible for protein synthesis 60% of a ribosome; up to 95% of total RNA in a cell Messenger (mRNA ) Translated into protein in ribosome 3-4% of total RNA in a cell have poly-A tails in eukaryotes Micro (miRNA) short (22 bp) non-coding RNA involved in expression regulation Transfer (tRNA) Bring specific amino acids for protein synthesis Others (lncRNA, shRNA, siRNA, snoRNA, etc.)
5
Experimental and Practical considerations
RNA-Seq Experimental and Practical considerations Usually don’t want to sequence rRNA, so need to select for other RNA types by: poly-A selection (eukaryotes only) ribosomal depletion Size selection
6
Illumina Sequencing Workflow
Fragment DNA Repair ends Add A overhang Ligate adapters Purify Library Preparation 1 6
7
From RNA -> sequence data
Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682
8
From RNA -> sequence data
Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682
9
From RNA -> sequence data
Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682
10
How do we sequence DNA? 1st generation: Sanger method (1987)
2nd generation (“next generation”; 2005): 454 - pyrosequencing SOLiD – sequencing by ligation Illumina – sequencing by synthesis Ion Torrent – ion semiconductor Pac Bio – Single Molecule Real-Time sequencing, 1000 bp 3rd generation (2015) Pac Bio – SMRT, Sequel system, 20,000 bp Nanopore – ion current detection 10X Genomics – novel library prep for Illumina
11
Illumina – “short read” sequencing
Rapid improvements over the years from 36 bp to 300 bp; highest throughput at 100 bp; many different types of sequencers for various applications. Can also “flip” a longer DNA strand and sequence from the other end to get paired-end reads Accuracy: 99.99% Biases: yes Most common platform for transcriptome sequencing …………….. Single-read 100nt Paired-end
12
Illumina Sequencing Workflow
Fragment DNA Repair ends Add A overhang Ligate adapters Purify Library Preparation 1 Cluster Generation Hybridize to flow cell Extend hybridized template Perform bridge amplification Prepare flow cell for sequencing 2 Sequencing Perform sequencing Generate base calls 3 Data Analysis Images Intensities Reads Alignments 4 12
13
Illumina Sequencing Technology Workflow
Single molecule array Cluster Growth Sequencing 5’ 3’ DNA ( μg) C A T G T G T A C Library Preparation Image Acquisition 1 2 3 7 8 9 4 5 6 Base Calling T G T A C G A T … 13
14
Illumina Sequencing Video
15
General Outline Getting the RNA-Seq data: from RNA -> Sequence data
Experimental and Practical considerations Commonly encountered file formats Transcriptomic analysis methods and tools Transcriptome Assembly Differential Gene expression
16
Experimental and Practical considerations
RNA-Seq Experimental and Practical considerations Experimental design Technical replicates Illumina has low technical variation unlike microarrays Technical replicates are unnecessary Batch effects Best to sequence everything for an experiment at the same time If you are preparing the libraries, be consistent & make them simultaneously Biological replicates This is essential for your experiment to have any statistical power At least 3, but the more the better
17
Experimental and Practical considerations
RNA-Seq Experimental and Practical considerations Experimental design For transcriptome assembly RNA can be pooled from various sources to ensure the most robust transcriptome Pooling can also be done after sequencing, but before assembly For differential gene expression Pooling RNA from multiple biological replicates is usually not advisable
18
Experimental and Practical considerations
RNA-Seq Experimental and Practical considerations Poly(A) enrichment or ribosomal RNA depletion? Unless interested in rRNA, it’s always best to remove Remember it makes up ~ 95% of the transcriptome Use Poly(A) enrichment if you are only interested in coding RNA This is very common, but do not use if interested in non- coding RNA as well For metatranscriptomics, you should also remove host DNA, but this can be done with bioinformatics tools downstream
19
Experimental and Practical considerations
RNA-Seq Experimental and Practical considerations Single-end (SE) or Paired end (PE)? SE is most common for DGE analysis PE is best for assemblies, for isoform differentiation, and for paralogous & orthologous gene differentiation (i.e. high-ploidy genomes & metatransciptomes) Single-end read Read1 ATGTTCCATAAGC… Paired-end reads Read1 ATGTTCCATAAGC… Read2 CCGTAATGGCATG…
20
Experimental and Practical considerations
RNA-Seq Experimental and Practical considerations Stranded? Most RNA-Seq library preparation kits produce stranded libraries Exception: single cell prep kits may be unstranded Can identify which strand of DNA the RNA was transcribed from Strandedness is advisable for all applications Three types of libraries Unstranded – The strand of DNA used to transcribe the reads is unknown Reverse – Reads were transcribed from the strand with complementary sequence to the reads Forward – Reads were transcribed from the strand that has a sequence identical to the reads
21
From RNA -> sequence data
Uracil DNA Glycosylase Uracil DNA Glycosylase Borodina T., Methods in Enzymology (2011) 500:79–98
22
General Outline Getting the RNA-Seq data: from RNA -> Sequence data
Experimental and Practical considerations Commonly encountered file formats Transcriptomic analysis methods and tools Transcriptome Assembly Differential Gene expression
23
File formats Sequence formats FASTA FASTQ Feature formats GFF GTF
A brief note Sequence formats FASTA FASTQ Feature formats GFF GTF Alignment formats SAM BAM
24
Formats: FASTA Deceptively simple format (e.g. there is no standard)
>unique_sequence_ID My sequence is pretty cool ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAAAA Deceptively simple format (e.g. there is no standard) However in general: Header line, starts with ‘>’ followed directly by an ID … and an optional description (separated by a space) Files can be fairly large (whole genomes) Any residue type (DNA, RNA, protein), but simple alphabet
25
Formats: FASTA E.g. a read E.g. a chromosome >unique_sequence_ID
ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAAAATTTATGATAAAA >Group10 gi| |ref|NC_ | Amel_4.5, whole genome shotgun sequence TAATTTATATATCTATTTTTTTTATTAAAAAATTTATATTTTTGTTAAAATTTTATTTGATTAGAAATAT TTTTACTATTGTTCATTAATCGTTAATTAAAGATAGCACAGCACATGTAAGAATTCTAGGTCATGCGAAA TTAAAAATTAAAAATATTCATATTTCTATAATAATTAAATTATTGTTTTAATTTAAGTAAAAAAATTTCT AAGAAATCAAAAATTTGTTGTAATATTGAAACAAAATTTTGTTGTCTGCTTTTTATAGTAACTAATAAAT ATTTAATAAAAAATTACTTTATTTAATATTTTATAATAAATCAAATTGTCCAATTTGAAATTTATTTTAT CACTAAAAATATCTTTATTATAGTCAATATTTTTTGTTAGGTTTAAATAATTGTTAAAATTAGAAAATGA TCGATATTTTCAAATAGTACGTTTAACTAATACTTAAGTGAAAGGTAAAGCGGTTATTTAAAATATTGAT TTATAATATTCGTGACATAATATATTTATAAATAGATTATATATATATATATACATCAAAATATTATACG AGAACTAGAAAATATTACAGATGCAAAATAAATTAAATTTTGTAAATGTTACAGAATTAAAAATCGAAGT
26
Formats: FASTQ FASTQ – FASTA with quality
@unique_sequence_ID ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAAAATTTATGATAAAA + DNA sequence with quality metadata The header line, starts with directly by an ID and an optional description (separated by a space) May be ‘raw’ data (straight from sequencing) or processed (trimmed) Variations: Sanger, Illumina, Solexa (Sanger is most common) Can hold 100’s of millions of records Files can be very large - 100’s of GB apiece
27
“Phred” quality (Q) scores
Whole-genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing Akihiro Fujimoto, Hidewaki Nakagawa, Naoya Hosono, Kaoru Nakano, Tetsuo Abe, Keith A Boroevich, Masao Nagasaki, Rui Yamaguchi, Tetsuo Shibuya, Michiaki Kubo, Satoru Miyano, Yusuke Nakamura & Tatsuhiko Tsunoda Nature Genetics, 2010 Q score Prob. of wrong call Accuracy 10 1 in 10 (0.1) 90% 20 1 in 100 (0.01) 99% 30 1 in 1000 (0.001) 99.9% 40 1 in (0.0001) 99.99%
28
Formats: FASTQ Sanger = Illumina 1.8+ FASTQ – FASTA with quality
@unique_sequence_ID ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAAAATTTATGATAAAA+unique_sequence_ID Sanger = Illumina 1.8+
29
Feature formats GTF/GFF3 SAM/BAM UCSC formats (BED, WIG, etc.)
30
Feature formats Used for mapping features against a particular sequence or genome assembly May or may not include sequence data The reference sequence must match the names from a related file (possibly FASTA) These are version (assembly)-dependent - they are tied to a specific version (assembly/release) of a reference genome Not all reference genomes are the represented the same! E.g. human chromosome 1 UCSC – ‘chr1’ Ensembl/NCBI – ‘1’ Best practice: get these from the same source as the reference
31
Feature formats : GTF Gene transfer format
Apr-18 Feature formats : GTF Gene transfer format Differences in representation of information make it distinct from GFF AB Twinscan CDS gene_id "001"; transcript_id "001.1"; AB Twinscan CDS gene_id "001"; transcript_id "001.1"; AB Twinscan CDS gene_id "001"; transcript_id "001.1"; AB Twinscan start_codon gene_id "001"; transcript_id "001.1"; AB Twinscan stop_codon gene_id "001"; transcript_id "001.1"; Source End location Strand Attributes (hierarchy) Chromosome ID Start location Reading frame Gene feature Score (user defined)
32
Feature formats : GTF Gene transfer format
Apr-18 Feature formats : GTF Gene transfer format Differences in representation of information make it distinct from GFF Source of GTF is important – Ensembl GTF is not quite the same as UCSC GTF AB Twinscan CDS gene_id "001"; transcript_id "001.1"; AB Twinscan CDS gene_id "001"; transcript_id "001.1"; AB Twinscan CDS gene_id "001"; transcript_id "001.1"; AB Twinscan start_codon gene_id "001"; transcript_id "001.1"; AB Twinscan stop_codon gene_id "001"; transcript_id "001.1"; *********Could add this last point to previous slide Source End location Strand Attributes (hierarchy) Chromosome ID Start location Reading frame Gene feature Score (user defined)
33
Feature formats : GFF3 Gene feature format (v3)
Tab-delimited file to store genomic features, e.g. genomic intervals of genes and gene structure Meant to be unified replacement for GFF/GTF (includes specification) All but UCSC have started using this (UCSC prefers their own internal formats) Chr1 amel_OGSv gene ID=GB42165 Chr1 amel_OGSv mRNA ID=GB42165-RA;Parent=GB42165 Chr1 amel_OGSv ’UTR Parent=GB42165-RA Chr1 amel_OGSv exon Parent=GB42165-RA Chr1 amel_OGSv exon Parent=GB42165-RA Source End location Strand Attributes (hierarchy) Chromosome ID Start location Gene feature Phase Score (user defined)
34
Feature formats: GFF3 vs. GTF
Apr-18 Feature formats: GFF3 vs. GTF GFF3 – Gene feature format GTF – Gene transfer format Always check which of the two formats is accepted by your application of choice, sometimes they cannot be swapped Chr1 amel_OGSv gene ID=GB42165 Chr1 amel_OGSv mRNA ID=GB42165-RA;Parent=GB42165 Chr1 amel_OGSv ’UTR Parent=GB42165-RA Chr1 amel_OGSv exon Parent=GB42165-RA Chr1 amel_OGSv exon Parent=GB42165-RA AB Twinscan CDS gene_id "001"; transcript_id "001.1"; AB Twinscan CDS gene_id "001"; transcript_id "001.1"; AB Twinscan CDS gene_id "001"; transcript_id "001.1"; AB Twinscan start_codon gene_id "001"; transcript_id "001.1"; AB Twinscan stop_codon gene_id "001"; transcript_id "001.1";
35
Apr-18 What is an alignment? Wikipedia - “a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences” How can we store this information about millions of reads that align to our reference genome?
36
Formats : SAM SAM – Sequence Alignment/Map format
SAM file format stores alignment information Plain text Specification: Contains quality information, meta data, alignment information, sequence etc. Files can be very large: Many 100’s of GB or more Normally converted into BAM to save space (and text format is mostly useless for downstream analyses) @HD [format version] @SQ SN:chr_1 LN: @PG [information about program that made this] HWI-D00758:59:C7U2JANXX:1:1101:1398: chr_ S9M * NAGCTCTTTA #/<<BFBBFF NH:i:1 HI:i:1 AS:i:93 nM:i:2
37
Formats : BAM BAM – BGZF compressed SAM format
Compressed/binary version of SAM and is not human readable. Uses a specialized compression algorithm optimized for indexing and record retrieval (bgzip) Makes the alignment information easily accessible to downstream applications (large genome file not necessary) Unsorted, sorted by sequence name, sorted by genome coordinates May be accompanied by an index file (.bai) (only if coordinate sorted) Files are typically very large: ~ 1/5 of SAM, but still very large
38
General Outline 4. Transcriptomic analysis methods and tools
Transcriptome Analysis; aspects common to both assembly and differential gene expression Download data Quality check Data alignment Assembly Differential Gene Expression Choosing a method, the considerations… Final thoughts and observations
39
Obtain sequence data If you are using the R.J.C. Biotechnology Center and the Biocluster Globus is most direct route CNRG instructions Download data to a computer and upload to Biocluster using an SFTP client Cyberduck, WinSCP… Can also use linux commands such as: scp, rsync, wget, …
40
Globus
41
Quality Checks How do my newly obtained data look?
Check for overall data quality. FastQC is a great tool that enables the quality assessment. Good quality! Poor quality!
42
Quality Checks How do my newly obtained data look?
Check for overall data quality. FastQC is a great tool that enables the quality assessment. In addition to the quality of each sequenced base, it will give you an idea of Presence of, and abundance of contaminating sequences. Average read length GC content NOTE – FastQC is good, but it is very strict and will not hesitate to call your dataset bad on one of the many metrics it tests the raw data for. Use logic and read the explanation for why and if it is acceptable.
43
Quality Checks What do I do when FastQC calls my data poor?
Poor quality at the ends can be remedied “quality trimmers” like trimmomatic, fastx-toolkit, etc. Left-over adapter sequences in the reads can be removed “adapter trimmers” like trimmomatic. Always trim adapters as a matter of routine We need to amend these issues so we get the best possible alignment After trimming, it is best to rerun the data through FastQC to check the resulting data
44
Transcriptome Analysis
Quality Checks Before quality trimming After quality trimming
45
Transcriptome Analysis
Data Alignment We need to align the sequence data to our genome of interest If aligning RNASeq data to the genome, almost always pick a splice-aware aligner Genome Gene Reads Versus Alignment Splice-Aware
46
Transcriptome Analysis
Data Alignment We need to align the sequence data to our genome of interest If aligning RNA-Seq data to the genome, always pick a splice-aware aligner (unless it’s a bacterial genome!) HiSat2, TopHat2, STAR, MapSplice, SOAPSplice, Passion, SpliceMap, RUM, ABMapper, CRAC, GSNAP, HMMSplicer … There are excellent aligners available that are not splice-aware. This is ideal for bacterial genomes. BWA, Novoalign (not free), Bowtie2, SOAPaligner, HiSat2
47
Transcriptome Analysis
Data Alignment Other considerations when choosing an aligner: How does it deal with reads that map to multiple locations? How does it deal with paired-end versus single-end data? How many mismatches will it allow between the genome and the reads? What assumptions does it make about my genome, and can I change these assumptions?
48
Transcriptome Analysis
Data Alignment Differential gene expression IGV is the visualization tool used for this snapshot
49
General Outline 4. Transcriptomic analysis methods and tools
Transcriptome Analysis; aspects common to both assembly and differential gene expression Download data Quality check Data alignment Assembly Differential Gene Expression Choosing a method, the considerations… Final thoughts and observations
50
Transcriptome Assembly Overview
Two main types of assembly Reference-based assembly A de novo assembly Differential gene expression
51
Transcriptome Assembly
Reference-based assembly Used when the genome reference sequence is known, and: Transcriptome data is not available Transcriptome data is available but not good enough, i.e. missing isoforms of genes, or unknown non-coding regions The existing transcriptome information is for a different tissue type Stringtie, Cufflinks and Scripture are two reference-based transcriptome assemblers
52
Transcriptome Assembly
Reference-based assembly Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682
53
Transcriptome Assembly
Reference-based assembly Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682
54
Transcriptome Assembly
Reference-based assembly Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682
55
Transcriptome Assembly
Reference-based assembly Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682
56
Transcriptome Assembly
De novo assembly Used when very little information is available for the genome Often the first step in putting together information about an unknown genome Amount of data needed for a good de novo assembly is higher than what is needed for a reference-based assembly Can be used for genome annotation, once the genome is assembled Trinity, and TransABySS, are examples of well-regarded transcriptome assemblers Transcriptome assembly, de novo, reference-based
57
Transcriptome Assembly
De novo assembly (De Bruijn graph construction) Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682
58
Transcriptome Assembly
De novo assembly (De Bruijn graph construction) Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682
59
Transcriptome Assembly
De novo assembly (De Bruijn graph construction) Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682
60
Combined Transcriptome Assembly
Align-then-assemble preferred for a trusted reference genome & for filtering out pathogen RNA Assemble-then-align preferred when reference genome is not trusted or is a closely related species to the reference Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682
61
How good is my assembly? Are all the genes I expected in the assembly?
Do I have complete genes? Are the contigs assembled correctly? How does it look compared to a close reference?
62
Tools for Evaluating Assembly: using the information you have
TransRate – evaluates assembly using reads, paired end information, reference genome, protein data, etc. Can generate a ‘cleaned-up’ or optimized assembly based on metrics DETONATE (from the RSEM developers) – evaluates assembly based on read mapping and/or reference information
63
Tools for Evaluating Assembly: conserved gene sets
CEGMA : From Ian Korf’s group, UC Davis Mapping Core Eukaryotic Genes - Discontinued! BUSCO: From Evgeny Zdobnov’s group, University of Geneva Coverage is indicative of quality and completeness of assembly
64
Outline Transcriptomic analysis methods and tools
Transcriptome Analysis; aspects common to both assembly and differential gene expression Quality check Data alignment Assembly Differential Gene Expression Choosing a method, the considerations… Final thoughts and observations
65
Differential Gene Expression Overview
Obtain/download sequence data Check quality of data and Trim low quality bases, and remove adapter sequence Align trimmed reads to genome of interest Pick alignment tool Index genome file Run alignment after choosing the relevant parameters If you are not studying human or mouse, check every parameter and confirm that they make the correct assumptions for your genome! Otherwise, change them Differential gene expression
66
Differential Gene Expression overview
Set up to do differential gene expression (DGE) Identify read counts associated with genes Do you want to obtain raw read counts or normalized read counts? This will depend on the statistical analysis you wish to perform downstream htseq & feature-counts return raw read counts Required for R programs like DESeq & EdgeR Ballgown & Cufflinks return FPKM normalized counts for each gene Differential gene expression
67
Differential Gene Expression
Options for DGE analysis
68
Differential Gene Expression
Options for DGE analysis
69
Differential Gene Expression
Options for DGE analysis
70
DGE Statistical Analyses
The first step is proper normalization of the data Often the statistical package you use will have a normalization method that it prefers and uses exclusively (e.g. Voom, FPKM, TMM (used by EdgeR)) Is your experiment a pairwise comparison? Ballgown/Cufflinks, EdgeR, DESeq Is it a more complex design? EdgeR, DESeq, other R/Bioconductor packages
71
Outline Transcriptomic analysis methods and tools
Transcriptome Analysis; aspects common to both assembly and differential gene expression Download data Quality check Data alignment Assembly Differential Gene Expression Choosing a method, the considerations… Final thoughts and observations
72
Transcriptome Analysis
How does one pick the right tools?
73
What does HPCBio use? Quality Check - FASTQC Trimming - Trimmomatic
Splice-aware alignment - STAR Bacterial alignment - BWA or Novoalign Counting reads per gene - featureCounts DGE Analysis - EdgeR or DESeq De novo transcriptome assembly - Trinity & RSEM
74
You won’t be using these tools today
Takes a lot of time to learn We do offer a long workshop on these methods during Spring break every year Check at the beginning of the year for updates
75
For today’s lab, you will use Tuxedo Suite
Easier and faster to learn Requires only a basic background in Linux (maybe R) Available on Galaxy, an online set of tools that’s ideal for small projects and teaching! Great option for simple statistical models Can perform all steps for DGE analysis very quickly Can do splice-aware & regular alignments (no need to switch software for bacteria) Can perform reference-based transcriptome assembly Does DGE analysis and creates statistical graphs
76
Why doesn’t HPCBio use Tuxedo suite?
Very new. Has only been tested/optimized with human data We work on a wide variety of organisms, and STAR works very well for us. No need to change yet… Can’t handle complex statistical models No more complicated than a pairwise-comparison We need more flexibility Doesn’t output raw read counts easily This is needed for use of R packages like DESeq & EdgeR, which can handle complex models
77
Differential Gene Expression
Bowtie/Bowtie2 use Burrows-Wheeler indexing for aligning reads (not splice-aware) Tophat/Tophat2 utilizes either Bowtie or Bowtie2 to align reads in a splice-aware manner and aids the discovery of new splice junctions The Cufflinks package has 4 components, the 2 major ones are listed below: Cufflinks does reference-based transcriptome assembly Cuffdiff does DGE analysis in a simple pairwise comparison, and a series of pairwise comparisons in a time-course experiment Options for DGE analysis (tuxedo suite) Trapnell et al., Nature Protocols, March 2012
78
Old Tuxedo suite NEW Tuxedo suite
Bowtie/Bowtie2 - general aligner Tophat/Tophat2 - splice-aware aligner Cufflinks suite - Transcriptome assembly, gene counting, and DGE calculation for simple models This suite is well-known, but there are a few issues with methods used No longer being supported! HISAT2 - DNA & RNA aligner in one StringTie - Transcriptome assembly, & gene counting Ballgown - Improved DGE calculation for simple models, and visualization This suite is very new and attempts to improve on the old suite’s issues However, not well-tested on anything but human at the moment
79
Old Tuxedo Suite Protocol
.fastq .bam .gtf or .gff3 Text A single merged gtf Trimmed sequence FASTQ files Alignment file Gene annotation file .gtf or .gff3 Alignment file Trapnell et al., Nature Protocols, March 2012
80
New Tuxedo Suite Protocol
Trimmed sequence FASTQ files Alignment file Gene annotation file Pertea, et al., Nature Protocols, August 2016
81
Note about today’s lab Unfortunately, the new Tuxedo suite isn’t fully available on Galaxy Hopefully this will change soon We will use the old Tuxedo suite Steps will be very similar to the new one
82
Outline Transcriptomic analysis methods and tools
Transcriptome Analysis; aspects common to both assembly and differential gene expression Download data Quality check Data alignment Assembly Differential Gene Expression Choosing a method, the considerations… Final thoughts and observations
83
Final thoughts and stray observations
Think carefully about what your experimental goals are before designing your experiment and choosing your bioinformatics tools
84
Final thoughts and stray observations
Think carefully about what your experimental goals are before designing your experiment and choosing your bioinformatics tools When in doubt “Google it” and ask questions. - Biostar (Bioinformatics explained) - SEQanswers (the next generation sequencing community)
85
Final thoughts and stray observations
Think carefully about what your experimental goals are before designing your experiment and choosing your bioinformatics tools When in doubt “Google it” and ask questions. - Biostar (Bioinformatics explained) - SEQanswers (the next generation sequencing community) Another good resource if you are not ready to use the command line routinely is Galaxy. It is a web-based bioinformatics portal that can be locally installed, if you have the necessary computational infrastructure.
86
Final thoughts and stray observations
Today we covered how to deal with Illumina data, but you may also encounter long-read data as well Hybrid transcriptome assemblies can be done, but are usually challenging
87
Final thoughts and stray observations
Today we covered how to deal with Illumina data, but you may also encounter long-read data as well Hybrid assemblies can be done, but are usually challenging R is an excellent language to learn, if you are interested in performing in- depth statistical analyses for differential gene expression analysis Not within the scope of this lecture/lab section We do offer a long RNA-Seq workshop that covers the “HPCBio” RNA-Seq pipeline:
88
Documentation and Support
Online resources for RNA-Seq analysis questions – Software manuals - Biostar (Bioinformatics explained) - SEQanswers (the next generation sequencing community) Most tools have a dedicated lists/forums Thanks Contact us at: See website for upcoming workshops & services:
89
Thank you for your attention!
Thanks
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.