Fast Sequence Alignments

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
BLAST Sequence alignment, E-value & Extreme value distribution.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Lecture outline Database searches
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
We continue where we stopped last week: FASTA – BLAST
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
1 BLAST – A heuristic algorithm Anjali Tiwari Pannaben Patel Pushkala Venkataraman.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Speed Up DNA Sequence Database Search and Alignment by Methods of DSP
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Protein Sequence Alignment and Database Searching.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Doug Raiford Phage class: introduction to sequence databases.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Local alignment and BLAST Usman Roshan BNFO 601. Local alignment Global alignment recursions: Local alignment recursions.
Heuristic Alignment Algorithms Hongchao Li Jan
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
BLAST BNFO 236 Usman Roshan. BLAST Local pairwise alignment heuristic Faster than standard pairwise alignment programs such as SSEARCH, but less sensitive.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Blast Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Local alignment and BLAST
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST.
Sequence alignment, Part 2
Lecture #7: FASTA & LFASTA
Basic Local Alignment Search Tool (BLAST)
BIOINFORMATICS Fast Alignment
Constructing Probability Matrices
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence alignment, E-value & Extreme value distribution
CSE 5290: Algorithms for Bioinformatics Fall 2009
Searching Sequence Databases
Presentation transcript:

Fast Sequence Alignments

Visualizing FASTA - dotplot Note: This is only a visualization, not a data structure.

FASTA – Actual data structure The actual data structure is a Hash Table. It the keys (work of length k) and their position in the text.

FASTA – The algorithm The query sequence (pattern) is hashed into words of length ktup. This hash table is then used to localize exact matches of substrings of length ktup in the subject sequence (text).

FASTA – The algorithm

FASTA – The algorithm The ten diagonals with the highest density of matches are rescored using a scoring matrix (in case of protein sequences) or a simple DNA scoring theme.

FASTA – The algorithm The diagonal regions the after rescoring have a score above some threshold are joined into one region. The joined region is aligned by dynamic programming.

FASTA – Runtime analysis Step 2’s runtime is proportional to the number of “dots” in the dotplot (exact matches). Define the probability of a match as 𝑃 𝑚𝑎𝑡𝑐ℎ . Is the probability constant? Then, the probability of matching a ktup is 𝑃 𝑚𝑎𝑡𝑐ℎ 𝑘𝑡𝑢𝑝 .

FASTA – Runtime analysis Then, the probability of matching a ktup is 𝑃 𝑚𝑎𝑡𝑐ℎ 𝑘𝑡𝑢𝑝 . Thus, step 2’s run time is essentially 𝑂 𝑚∗𝑛∗𝑃 𝑚𝑎𝑡𝑐ℎ 𝑘𝑡𝑢𝑝 .

FASTA – Runtime analysis But 𝑂 𝑚∗𝑛∗𝑃 𝑚𝑎𝑡𝑐ℎ 𝑘𝑡𝑢𝑝 is essentially 𝑂(𝑚∗𝑛). Is this still worth while?

FASTA – A possible speed up Step 4 includes dynamic programming. Although for relatively small areas, maybe it’ll be wise to avoid it. BLAST’s original implementation does exactly that.

BLAST – Basic Local Alignment Search Tool BLAST was published two tears after FASTA and at the time was an order of magnitude faster. Do note that both algorithms have changed quite a bit since, and both have multiple versions. Similar to FASTA, BLAST also excludes unpromising areas of the alignment matrix.

BLAST – Maximal Segment Pair Maximal Segment Pairs (MSPs) are pairs of equal length substrings (of the query and text sequences) whose scores cannot be improved by extension. MSPs are the central notion of the BLAST algorithm.

BLAST – The algorithm Compilation of a list of high scoring words from the query sequence. A scan of the text sequence(s) for matches of these words. Extension of the matches.

BLAST – The algorithm Compilation of a list of high scoring words from the query sequence. A scan of the text sequence(s) for matches of these words. Extension of the matches.

BLAST – Stage 1 – query listing Similar to the ktup list generated in the FASTA algorithm, a list of words is created. Here however, for every substring 𝑠 of length 𝑤 in the query, “similar” words are stored as well. How do we define similar words?

BLAST – Stage 1 – query listing Similar words are words of the same length as 𝑠, which when aligned to 𝑠 (using a scoring matrix), score ≥𝑇, where 𝑇 is a chosen threshold. How would we store the words if we were dealing with proteins? What about DNA?

BLAST – Stage 1 – query listing

BLAST – Stage 1 – query listing Similar to the ktup list generated in the FASTA algorithm, a list of words is created. Here however, for every substring 𝑠 of length 𝑤 in the query, “similar” words are stored as well. How do we define similar words?

BLAST – Stage 2 – scanning the text Now the text sequence(s) is scanned for the list of words from step 1. What possible ways are you familiar with which can achieve this task? In how much time?

BLAST – Stage 2 – scanning the text BLAST researchers used a very memory efficient way of scanning the text, which is beyond the scope of this course. However, with their system they achieved a runtime of 𝑂(𝑛+𝑙+𝑘) where 𝑛 is the size of the database, 𝑙 is the length of the word list, and 𝑘 the number of matches on the word list. Basically 𝑂(𝑛).

BLAST – Stage 3 – extension The high scoring segment pair found before (HSPs) are now extended on both sides. Extension stops when the extended HSP’s score drops under a chosen threshold (not the same one from stage 2). The results are MSPs.

BLAST – Stage 3 – extension Runtime of this stage is proportional to the number of HSPs found in stage 2. Under a random sequence model the number of HSPs is proportional to 𝑃 𝑚𝑎𝑡𝑐ℎ 𝑤 , where 𝑤 is the length of the words in the list. So the total runtime of this stage is 𝑂 𝑛𝑙𝑃 𝑚𝑎𝑡𝑐ℎ 𝑤 .

BLAST – Some further runtime discussion Let us introduce two definitions to help us analyse BLAST (and many other similar algorithms): Sensitivity – The proportion of correct results returned (in our case, homologous) among all correct results in the DB. Specificity – The proportion of correct results returned among all of the returned results.

BLAST – Some further runtime discussion In BLAST, a larger word list will increase sensitivity. However, this will elongate the algorithm’s runtime and might decrease specificity. Different values of 𝑤 and 𝑇 𝑤𝑖𝑙𝑙 change the balance between sensitivity and specificity.