From Smith-Waterman to BLAST

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST Sequence alignment, E-value & Extreme value distribution.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
Bioinformatics and Phylogenetic Analysis
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Protein Sequence Comparison Patrice Koehl
Sequence alignment, E-value & Extreme value distribution
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Sequence comparison: Local alignment
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Chapter 3 Computational Molecular Biology Michael Smith
PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Lecture 15 Algorithm Analysis
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Bioinformatics Computing 1 CMP 807 – Day 2 Kevin Galens.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
Heuristic Alignment Algorithms Hongchao Li Jan
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Local alignment
Homology Search Tools Kun-Mao Chao (趙坤茂)
Fast Sequence Alignments
BIOINFORMATICS Fast Alignment
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

From Smith-Waterman to BLAST Jeremy Buhler (in absentia) Wilson Leung 07/2015

Key limitations of the Smith-Waterman local alignment algorithm Quadratic in time and space complexity Report only one optimal alignment Usually want all interesting alignments Example: map a mRNA against a genome Query Genome Query sequence Paralogous features in genome

Smith-Waterman implementations Bill Pearson’s ssearch http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml Water (EBI / EMBOSS) http://www.ebi.ac.uk/Tools/psa/emboss_water/ DeCypherSW from TimeLogic http://www.timelogic.com/catalog/775 DeCypherSW uses J-series FPGA cards

BLAST alignment strategy: generate and filter Query Database Initial candidates Refined candidates Filtering Final Alignments Smith-Waterman Other sequence alignment tools use a similar approach Goal: minimize the need for calculating Smith-Waterman alignments

Challenges with the BLAST alignment strategy Identify candidate patterns Find the best alignment “near” a candidate

Identify candidate patterns High-scoring alignment between two sequences will contain some consecutive matches Treat k-mer (word) matches as candidates …atacatcactacgatcc-a… …agacatg--tgcaatccca… acat atcc (k = 4)

Locate k-mer matches (k=3) 1 2 3 4 5 6 7 a c g t a c g Query: 3-mers in query 3-mer matches in database a c g c g t a 1 2 3 4 5 6 7 3-mer Position(s) acg 1 , 5 cgt 2 BLAT index the database instead of the query k-mer match Database position(s) Query position(s) gta 3 acg 1 1, 5 tac 4 cgt 4 2 gta 5 3

Use a hash table to more efficiently store k-mers A table of 4k entries is required to store all possible k-mers of a DNA query sequence BLAST uses a hash table to store k-mers Space requirement proportional to the query size Reduces the time required to the sum of the lengths of the two sequences SW time complexity is proportional to the product of the two sequence lengths

Other “Build a Table” abstractions Search multiple queries against a database BLAT: index the database More space-efficient index structures Suffix array Burrows-Wheeler transform FM-index Used by second-generation sequence aligners (e.g., BWA, Bowtie) Li H and Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics. 2010 Sep;11(5):473-83.

k-mer size affects the sensitivity and specificity of the search How “good” are the candidate matches? Trade off between sensitivity (true positives) and specificity (true negatives) k = 1 (high sensitivity) k = entire sequence (high specificity)

Quantifying specificity Given DNA sequences S and T i.i.d. random with equal base frequencies Probability of 1 bp match: Probability of k-mer match: i.i.d. = independent and identically distributed Number of matches grow exponentially as k increases (these matches are “junk”) Log410(3+8) = 18bp Unique primers = 18-20bp Expected number of k-mer matches: Search 1kb pattern against a 1Gb database:

Quantifying sensitivity Require at least one k-mer match to detect an alignment between S and T Sequences with lower percent identity have fewer k-mer matches How large a value of k is likely to detect most alignments?

Word length versus probability of occurrence Target length (L) = 100 1.0 k=11 80% identity 0.9 67% identity 0.8 0.7 0.6 Probability of occurrence (L=100) 0.5 0.4 In practice, 0.3 0.2 0.1 6 7 8 9 10 11 12 13 14 15 16 Word length (k)

Adjust k-mer size based on the level of sequence similarity BLAT (k=15) Find highly similar sequences blastn (k=11) Find most medium to high similarity alignments Most candidates are false positives RepeatMasker (k=8) Find highly diverged repeat copies Megablast: word size = 28

Use more sensitive parameters to identify the initial transcribed exon Program Selection: From megablast to blastn Word Size: From 11 to 7 Match/Mismatch Scores: From +2/-3 to +1/-1 Gap Costs: Existence: from 5 to 2 Extension: from 2 to 1 Can use more sensitive parameters because we are searching a small region

Word match for protein sequences Use shorter k-mer: blastp (k=3) Allow approximate matches using similarity: Keep all word matches with score ≥ T (neighborhood) Reduce number of spurious candidates: Require two word matches along the same diagonal (two-hit algorithm)

Use dynamic programming (DP) to filter candidates Search the region surrounding each candidate Define the size and shape of the search region Report multiple high-scoring alignments Align a multi-exon mRNA against a genome Report alignments to all exons Problem with throwing things out (shadowing)

The “shadowing problem” A “good” alignment might be omitted because of a better alignment within the search region Genome Query Tandem duplication Missing exon Multi-exon gene Genome Aʹ Aʹ A No similarity A Lower-scoring A is shadowed by higher-scoring Aʹ

Solution: pin the alignment Candidate match is centered on S[i], T[j] Compute optimal alignments that pass through (i, j) Half-anchor alignments Ab Ab = Best alignment that ends at (i, j) A A = Best alignment (combine Af and Ab) (i, j) Sequence T Af Af = Best alignment that starts from (i, j) BLAST does not use this strategy but other BLAST-like tools use this strategy Address shadowing: omit nearby alignments that do not pass through the candidate Calculate Ab by running DP backward Sequence S

Define the size of the two search regions One option: bound the search regions by the ends of the two sequences Best case: half of the entire DP matrix Worst case: cost as much as not filtering (i, j) DP fill region ≥ half of matrix

The “chaining problem” Opposite problem to shadowing Connect multiple features into a single alignment Sequence 2 +40 Junk! Sequence 1 Feature A Feature B Problem with not limiting the size of the search region Sequence 1 -39 Sequence 2 +40

BLAST often chains multiple alignment blocks into a single alignment tblastn of CaMKII-PA (query) against the D. mojavensis genome (subject) Mills LJ and Pearson WR. Adjusting scoring matrices to correct overextended alignments. Bioinformatics. 2013 Dec 1;29(23):3007-13.

Ignore alignments that are “not promising” Ignore alignments with very large gaps Usually have poor score Can identify second feature from its own candidates Limit search region to the diagonal surrounding the candidate The bandwidth (b) parameter controls the width of the diagonal Gaps usually more expensive than substitutions in most scoring systems

Use banded alignments to reduce the search space (i, j) 2b+1 Number of DP entries to compute is proportional to the length of the shorter sequence (times b)

Use X-drop to further reduce the search space Terminate the alignment if the score drops below x compared to the optimal score Af (i*, j*) (u, v) Mi*,j* = σ Mu,v < σ - x Implementation detail: fill the score below drop-off score with negative infinity If total score of Af is ≥ σ, the score of this piece must be > x Minimum score?

BLAST X-drop strategy σ X Cumulative score Length of extension Final alignment σ X Cumulative score Trim back to position with the highest score Length of extension Korf, I., Yandell, M., and Bedell, J. (2003). BLAST. O’Reilly Media, Inc.

Summary BLAST uses a generate and filter strategy Generate candidate matches Filter using dynamic programming (DP) Mitigates problems with shadowing and chaining Minimizes the amount of time spent on DP Banded alignment X-drop

Questions?

Some alignments with |(jʹ – iʹ) – (j – i)| > b b1 + b2 > b b1 Impossible alignments for bandwidth b b2 (iꞌ,jꞌ) Some alignments with |(jʹ – iʹ) – (j – i)| > b