TOPHAT Next-Generation Sequencing Workshop RNA-Seq Mapping

Slides:



Advertisements
Similar presentations
RNA-Seq as a Discovery Tool
Advertisements

RNA-Seq based discovery and reconstruction of unannotated transcripts
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
RNAseq.
Transcriptome Sequencing with Reference
Peter Tsai Bioinformatics Institute, University of Auckland
Institute for Quantitative & Computational Biosciences Workshop4: NGS- study design and short read mapping.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Al Ritacco, Shailender Nagpal Research Computing
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
NGS Analysis Using Galaxy
Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
Introduction to RNA-Seq and Transcriptome Analysis
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Todd J. Treangen, Steven L. Salzberg
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA
Introduction to RNA-Seq & Transcriptome Analysis
LOC_Os02g08480 Supplementary Figure S1. Exons shorter than a read length have few or no reads aligned. The gene at LOC_Os02g08040 contains exons shorter.
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Transcriptome Analysis
RNA-seq workshop ALIGNMENT
The iPlant Collaborative
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
The iPlant Collaborative
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
No reference available
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
Short Read Workshop Day 5: Mapping and Visualization
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
+ RNAseq for differential gene expression analysis Molly Hammell, PhD
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
RNAseq: a Closer Look at Read Mapping and Quantitation
NGS File formats Raw data from various vendors => various formats
GCC Workshop 9 RNA-Seq with Galaxy
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
VCF format: variants c.f. S. Brown NYU
RNA-Seq analysis in R (Bioconductor)
Detect alternative splicing
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Figure 3. Schematic of the parameters to assess junctions in SpliceMap
Eukaryotic Gene Finding
Jin Zhang, Jiayin Wang and Yufeng Wu
Reference based assembly
From: TopHat: discovering splice junctions with RNA-Seq
Maximize read usage through mapping strategies
Additional file 2: RNA-Seq data analysis pipeline
Determine CDS Coordinates
Sequence Analysis - RNA-Seq 2
Transcriptomics – towards RNASeq – part III
Fig. 5 E2F1 also interacts with alternatively spliced transcripts from the MECOM gene. E2F1 also interacts with alternatively spliced transcripts from.
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

TOPHAT Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Missions for RNA-Seq Mapping

Before Tophat Previous software for aligning RNA-Seq data relies on known splice junctions and cannot identify novel ones.

TOPHAT Tophat is designed to align reads from RNA-Seq experiment to a reference genome without relying on known splicing sites. Tophat is free and available from http://tophat.cbcb.umd.edu

Patterns of alternative splicing adfafdaf Xing et al. 2006

Tophat pipeline Trapnell et al. 2009

Step I: mapping with Bowtie Adjustable parameters: -mismatches -multireads No more than a few mismatches (two, by default) in the 5-most s bases of the read The Phred-quality-weighted Hamming distance is less than a specified threshold (70 by default). TopHat allows Bowtie to report more than one alignment for a read (default = 10)

Step II. island assembly Use Maq assembly module to produce pseudo-consensus exons (islands). Use reference genome to call bases. Merge exon gaps(6bp). Elongate 45bp to both sides of each islands. Adjustable parameters: -consensus call -flanking extention -gap merge

Step III. Creating candidate junction database TopHat first enumerates all canonical donor and acceptor sites within the island sequences (as well as their reverse complements). Next, it considers all pairings of these sites that could form canonical (GT–AG) introns between neighboring (but not necessarily adjacent) islands. By default, TopHat only examines potential introns longer than 70 bp and shorter than 20 000 bp.

Single island junctions In order to detect such junctions without sacrificing performance and specificity, TopHat looks for introns within islands that are deeply sequenced.

Step IV. Looking for junction reads Each possible intron is checked against the IUM reads for reads that span the splice junction. The seed-and-extend strategy is used to match reads to possible splice sites. TopHat only examines the first 28 bp on the 5 end of each read by default. Default : k=5bp s=28bp s-2k+1 seeds TopHat will miss spliced alignments to reads with mismatches in the seed region of the splice junction

Step V. Filtering false junctions Wang et al. (2008) observed that 86% of the minor isoforms were expressed at least 15% of the level of the major isoform. For each junction, the average depth of read coverage is computed for the left and right flanking regions of the junction separately. The number of alignments crossing the junction is divided by the coverage of the more deeply covered side to obtain an estimate of the minor isoform frequency. 15% is the default cut-off.

Old Tophat’s pipeline Trapnell et al. 2009

Reads are becoming longer, and paired-sequencing are more and more common …

Current Tophat (latest 1.3.1) Segment Search Butterfly search Closure search Coverage Search Gene model annotations

I. Segment search --segment-length --segment-mismatches --min-segment-intron --max-segment-intron

I. Segment search

II. Closure search --closure-search --no-coverage-search --min-closure-intron --max-closure-intron Closure search is only used when TopHat is run with paired end reads Closure search should only be used when the expected inner distance between mates is small (<= 50bp)

III. Coverage search --coverage-search :disabled for reads 75bp or longer --no-coverage-search --min-coverage-intron --max-coverage-intron

IV. Butterfly search --butterfly-search Consider using this if you expect that your experiment produced a lot of reads from pre-mRNA, that fall within the introns of your transcripts.

V. Junction annotations -G/--GTF <GTF 2 or GFF3> -j/--raw-juncs <.juncs file>. --no-novel-junctions Only look for reads across junctions indicated in the supplied GFF or junctions file.

Input Reference sequence indexed by bowtie_index Fastq sequences tophat [options]* <index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2] Reference sequence indexed by bowtie_index Fastq sequences Quality format ? phred33 (default) --solexa-quals --solexa1.3-quals Paired-ends ? Strand-specific ? Multi-files ?

The software is optimized for reads 75bp or longer. Mixing paired- and single- end reads together is not supported.

Strand-specific data --library-type TopHat will treat the reads as strand specific.

Paired-end data -r/--mate-inner-dist <int> This is the expected (mean) inner distance between mate pairs. --mate-std-dev <int> The standard deviation for the distribution on inner distances between mate pairs.

Other parameters --bowtie-n (after tophat 1.3.0) -g/--max-multihits -a/--min-anchor-length (>=3, default 8) -m/--splice-mismatches (default 0) -F/--min-isoform-fraction <0.0-1.0> -p/--num-threads --keep-tmp

Output accepted_hits.bam A list of read alignments in SAM format. junctions.bed insertions.bed deletions.bed

References Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics doi:10.1093/bioinformatics/btp120 Tophat manual http://tophat.cbcb.umd.edu/manual.html Further Readings: Tophat-fusion

Practice time All the files are at: ngs_vm1: Reference sequence: REF.fa RNA-seq data 1: SampleA.Run01 SampleA.Run02 paired-end 50nt at each end Phred33 quality RNA-seq data 2 SampleB.Run01 75nt at each end strand-specific solexa1.3-quals

Index the genome sequence bowtie-build REF.fa REF

Run tophat tophat --version # update is frequent; version is important tophat # go through all the parameters tophat \ -o sampleA.ouput \ -r 100 \ --mate-std-dev 30 \ REF \ SampleA.Run01.1.fastq,SampleA.Run02.1.fastq \ SampleA.Run01.2.fastq,SampleA.Run02.2.fastq

Run tophat tophat \ -o sampleB.ouput \ -r 50 \ --mate-std-dev 30 \ --library-type fr-firststrand \ --solexa1.3-quals \ REF \ SampleB.Run01.1.fastq \ SampleB.Run01.2.fastq