Presentation is loading. Please wait.

Presentation is loading. Please wait.

TOPHAT Next-Generation Sequencing Workshop RNA-Seq Mapping

Similar presentations

Presentation on theme: "TOPHAT Next-Generation Sequencing Workshop RNA-Seq Mapping"— Presentation transcript:

1 TOPHAT Next-Generation Sequencing Workshop RNA-Seq Mapping
Center for Bioinformatics Hanqing Zhao

2 Missions for RNA-Seq Mapping

3 Before Tophat Previous software for aligning RNA-Seq data relies on known splice junctions and cannot identify novel ones.

4 TOPHAT Tophat is designed to align reads from RNA-Seq experiment to a reference genome without relying on known splicing sites. Tophat is free and available from

5 Patterns of alternative splicing
adfafdaf Xing et al. 2006

6 Tophat pipeline Trapnell et al. 2009

7 Step I: mapping with Bowtie
Adjustable parameters: -mismatches -multireads No more than a few mismatches (two, by default) in the 5-most s bases of the read The Phred-quality-weighted Hamming distance is less than a specified threshold (70 by default). TopHat allows Bowtie to report more than one alignment for a read (default = 10)

8 Step II. island assembly
Use Maq assembly module to produce pseudo-consensus exons (islands). Use reference genome to call bases. Merge exon gaps(6bp). Elongate 45bp to both sides of each islands. Adjustable parameters: -consensus call -flanking extention -gap merge

9 Step III. Creating candidate junction database
TopHat first enumerates all canonical donor and acceptor sites within the island sequences (as well as their reverse complements). Next, it considers all pairings of these sites that could form canonical (GT–AG) introns between neighboring (but not necessarily adjacent) islands. By default, TopHat only examines potential introns longer than 70 bp and shorter than bp.

10 Single island junctions
In order to detect such junctions without sacrificing performance and specificity, TopHat looks for introns within islands that are deeply sequenced.

11 Step IV. Looking for junction reads
Each possible intron is checked against the IUM reads for reads that span the splice junction. The seed-and-extend strategy is used to match reads to possible splice sites. TopHat only examines the first 28 bp on the 5 end of each read by default. Default : k=5bp s=28bp s-2k+1 seeds TopHat will miss spliced alignments to reads with mismatches in the seed region of the splice junction

12 Step V. Filtering false junctions
Wang et al. (2008) observed that 86% of the minor isoforms were expressed at least 15% of the level of the major isoform. For each junction, the average depth of read coverage is computed for the left and right flanking regions of the junction separately. The number of alignments crossing the junction is divided by the coverage of the more deeply covered side to obtain an estimate of the minor isoform frequency. 15% is the default cut-off.

13 Old Tophat’s pipeline Trapnell et al. 2009

14 Reads are becoming longer, and paired-sequencing are more and more common …

15 Current Tophat (latest 1.3.1)
Segment Search Butterfly search Closure search Coverage Search Gene model annotations

16 I. Segment search --segment-length --segment-mismatches
--min-segment-intron --max-segment-intron

17 I. Segment search

18 II. Closure search --closure-search --no-coverage-search
--min-closure-intron --max-closure-intron Closure search is only used when TopHat is run with paired end reads Closure search should only be used when the expected inner distance between mates is small (<= 50bp)

19 III. Coverage search --coverage-search :disabled for reads 75bp or longer --no-coverage-search --min-coverage-intron --max-coverage-intron

20 IV. Butterfly search --butterfly-search
Consider using this if you expect that your experiment produced a lot of reads from pre-mRNA, that fall within the introns of your transcripts.

21 V. Junction annotations
-G/--GTF <GTF 2 or GFF3> -j/--raw-juncs <.juncs file>. --no-novel-junctions Only look for reads across junctions indicated in the supplied GFF or junctions file.

22 Input Reference sequence indexed by bowtie_index Fastq sequences
tophat [options]* <index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2] Reference sequence indexed by bowtie_index Fastq sequences Quality format ? phred33 (default) --solexa-quals --solexa1.3-quals Paired-ends ? Strand-specific ? Multi-files ?

23 The software is optimized for reads 75bp or longer.
Mixing paired- and single- end reads together is not supported.

24 Strand-specific data --library-type TopHat will treat the reads as strand specific.

25 Paired-end data -r/--mate-inner-dist <int>
This is the expected (mean) inner distance between mate pairs. --mate-std-dev <int> The standard deviation for the distribution on inner distances between mate pairs.

26 Other parameters --bowtie-n (after tophat 1.3.0) -g/--max-multihits -a/--min-anchor-length (>=3, default 8) -m/--splice-mismatches (default 0) -F/--min-isoform-fraction < > -p/--num-threads --keep-tmp

27 Output accepted_hits.bam A list of read alignments in SAM format.
junctions.bed insertions.bed deletions.bed

28 References Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics doi: /bioinformatics/btp120 Tophat manual Further Readings: Tophat-fusion

29 Practice time All the files are at: ngs_vm1:
Reference sequence: REF.fa RNA-seq data 1: SampleA.Run01 SampleA.Run02 paired-end 50nt at each end Phred33 quality RNA-seq data 2 SampleB.Run01 75nt at each end strand-specific solexa1.3-quals

30 Index the genome sequence
bowtie-build REF.fa REF

31 Run tophat tophat --version # update is frequent; version is important
tophat # go through all the parameters tophat \ -o sampleA.ouput \ -r \ --mate-std-dev 30 \ REF \ SampleA.Run01.1.fastq,SampleA.Run02.1.fastq \ SampleA.Run01.2.fastq,SampleA.Run02.2.fastq

32 Run tophat tophat \ -o sampleB.ouput \ -r 50 \ --mate-std-dev 30 \
--library-type fr-firststrand \ --solexa1.3-quals \ REF \ SampleB.Run01.1.fastq \ SampleB.Run01.2.fastq


Download ppt "TOPHAT Next-Generation Sequencing Workshop RNA-Seq Mapping"

Similar presentations

Ads by Google