Fast Sequence Alignments

Fast Sequence Alignments

Visualizing FASTA - dotplot
Note: This is only a visualization, not a data structure.

FASTA – Actual data structure
The actual data structure is a Hash Table. It the keys (work of length k) and their position in the text.

FASTA – The algorithm The query sequence (pattern) is hashed into words of length ktup. This hash table is then used to localize exact matches of substrings of length ktup in the subject sequence (text).

FASTA – The algorithm

FASTA – The algorithm The ten diagonals with the highest density of matches are rescored using a scoring matrix (in case of protein sequences) or a simple DNA scoring theme.

FASTA – The algorithm The diagonal regions the after rescoring have a score above some threshold are joined into one region. The joined region is aligned by dynamic programming.

FASTA – Runtime analysis
Step 2’s runtime is proportional to the number of “dots” in the dotplot (exact matches). Define the probability of a match as 𝑃 𝑚𝑎𝑡𝑐ℎ . Is the probability constant? Then, the probability of matching a ktup is 𝑃 𝑚𝑎𝑡𝑐ℎ 𝑘𝑡𝑢𝑝 .

Then, the probability of matching a ktup is 𝑃 𝑚𝑎𝑡𝑐ℎ 𝑘𝑡𝑢𝑝 . Thus, step 2’s run time is essentially 𝑂 𝑚∗𝑛∗𝑃 𝑚𝑎𝑡𝑐ℎ 𝑘𝑡𝑢𝑝 .

But 𝑂 𝑚∗𝑛∗𝑃 𝑚𝑎𝑡𝑐ℎ 𝑘𝑡𝑢𝑝 is essentially 𝑂(𝑚∗𝑛). Is this still worth while?

FASTA – A possible speed up
Step 4 includes dynamic programming. Although for relatively small areas, maybe it’ll be wise to avoid it. BLAST’s original implementation does exactly that.

BLAST – Basic Local Alignment Search Tool
BLAST was published two tears after FASTA and at the time was an order of magnitude faster. Do note that both algorithms have changed quite a bit since, and both have multiple versions. Similar to FASTA, BLAST also excludes unpromising areas of the alignment matrix.

BLAST – Maximal Segment Pair
Maximal Segment Pairs (MSPs) are pairs of equal length substrings (of the query and text sequences) whose scores cannot be improved by extension. MSPs are the central notion of the BLAST algorithm.

BLAST – The algorithm Compilation of a list of high scoring words from the query sequence. A scan of the text sequence(s) for matches of these words. Extension of the matches.

BLAST – Stage 1 – query listing
Similar to the ktup list generated in the FASTA algorithm, a list of words is created. Here however, for every substring 𝑠 of length 𝑤 in the query, “similar” words are stored as well. How do we define similar words?

Similar words are words of the same length as 𝑠, which when aligned to 𝑠 (using a scoring matrix), score ≥𝑇, where 𝑇 is a chosen threshold. How would we store the words if we were dealing with proteins? What about DNA?

Similar to the ktup list generated in the FASTA algorithm, a list of words is created. Here however, for every substring 𝑠 of length 𝑤 in the query, “similar” words are stored as well. How do we define similar words?

BLAST – Stage 2 – scanning the text
Now the text sequence(s) is scanned for the list of words from step 1. What possible ways are you familiar with which can achieve this task? In how much time?

BLAST – Stage 2 – scanning the text
BLAST researchers used a very memory efficient way of scanning the text, which is beyond the scope of this course. However, with their system they achieved a runtime of 𝑂(𝑛+𝑙+𝑘) where 𝑛 is the size of the database, 𝑙 is the length of the word list, and 𝑘 the number of matches on the word list. Basically 𝑂(𝑛).

BLAST – Stage 3 – extension
The high scoring segment pair found before (HSPs) are now extended on both sides. Extension stops when the extended HSP’s score drops under a chosen threshold (not the same one from stage 2). The results are MSPs.

BLAST – Stage 3 – extension
Runtime of this stage is proportional to the number of HSPs found in stage 2. Under a random sequence model the number of HSPs is proportional to 𝑃 𝑚𝑎𝑡𝑐ℎ 𝑤 , where 𝑤 is the length of the words in the list. So the total runtime of this stage is 𝑂 𝑛𝑙𝑃 𝑚𝑎𝑡𝑐ℎ 𝑤 .

BLAST – Some further runtime discussion
Let us introduce two definitions to help us analyse BLAST (and many other similar algorithms): Sensitivity – The proportion of correct results returned (in our case, homologous) among all correct results in the DB. Specificity – The proportion of correct results returned among all of the returned results.

BLAST – Some further runtime discussion
In BLAST, a larger word list will increase sensitivity. However, this will elongate the algorithm’s runtime and might decrease specificity. Different values of 𝑤 and 𝑇 𝑤𝑖𝑙𝑙 change the balance between sensitivity and specificity.

Fast Sequence Alignments

Similar presentations

Presentation on theme: "Fast Sequence Alignments"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast Sequence Alignments

Similar presentations

Presentation on theme: "Fast Sequence Alignments"— Presentation transcript:

Similar presentations

About project

Feedback