SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.

SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael G. Walker Michael G. Walker James Z. Wang James Z. Wang Wayne Volkmuth Wayne Volkmuth Bioinformatics Vol. 18 no. 6 – 2002 Pages 873-879 Norman Casagrande 2003 – IFT 6291

Outline Motivation Previous Related Research The SST Algorithm Computational Results Discussion

Motivation Searches for near-exact sequence matches are performed frequently in large-scale sequencing projects and in comparative genomics. The time and cost of performing these searches is prohibitive with current algorithms. Faster algorithms are desired.

Previous related research Needleman-Wunsch and Smith-Waterman These algorithms perform global and local sequence alignment using dynamic programming. Time complexity: O(mn). m = length of query sequence n = sum of lengths of all sequences in the database

Previous related research FASTA This algorithm identifies regions of local sequence similarity by first identifying candidate similar sequences based on shared k-tuples and performs local alignment with the Smith-Waterman algorithm. Time complexity: O(mn).

Previous related research BLAST This algorithm identifies regions of local sequence similarity by first identifying candidate similar sequences that have k-tuples in common with the query sequence, and then extending the regions of similarity. Time complexity: O(n).

The SST Algorithm The SST Algorithm Database partitioning with sliding windows Mapping windows into vector space Tree-structured index for database windows The search procedure

SST: Database Partitioning First step - Database partitioned into overlapping windows. - Fixed windows of length W. Typically: 25-1000 - Measure of overlap parameter ∆ is typically: 5 ≤ ∆ ≤ W/2

SST: Database Partitioning AACCGGTTACGTACG... Norman Casagrande 2003 – IFT 6291 AACCGG W = 6 CCGGTT GGTTAC …. ∆ = 2

SST: Query Partitioning Query sequence partitioned into –Non-overlapping windows or –Windows which overlap by half of their length: Norman Casagrande 2003 – IFT 6291

SST: Mapping Windows into Vector For each window, create a vector which counts the number of occurrences of each k-tuple. Tuple size k: 2 – 10 Typically: 4 or 5 (empirically found)

SST: Mapping Windows into Vector Assume window: 1100011000100000 AAACAGATCACCCGCTGAGCGGGTTATCTGTT AACCGG k=2 → 16 occurrences Resulting vector Norman Casagrande 2003 – IFT 6291

SST: Creation of Tree-structured Index Distance between vectors as heuristic function for distance between sequences. A = (01), B = (10) d = Σ |A i - B i | = |(0-1)| + |(1-0)| = 2 Method: TSVQ (Gersho and Gray, 1992)

SST: Creation of Tree-structured Index Select two centroids X A and X B and their corresponding partitions of the data into disjoint set A and set B using the following iterative procedure: - Choose two initial values for X A and X B. - For each vector y in the database, compute the distance d from the vector to each of the centroids. Assign y to set A if d A < d B, and to set B otherwise. d A (y) = |X Ai - y i |, d B (y) = |X Bi - y i |

SST: Creation of Tree-structured Index - Compute the new centroids: where |A| = size of set A, |B| = size of set B X A = ΣyAyΣyAy |A| X B = ΣyByΣyBy |B|

SST: Creation of Tree-structured Index - Compute values for the terminating criteria: D A = Σ y  A d A (y) D B = Σ y  B d B (y) - Repeat until the change in D A and D B is less than a small threshold, or no vectors change partition.

SST: Creation of Tree-structured Index Recursively partition the set A and B generated above using the same algorithm. The recursion terminates when the number of vectors in a set is smaller than a specified tolerance or when the algorithm fails to fragment a cluster into two substantial new clusters.

SST: Creation of Tree-structured Index TSVQ: - Each leave contains the set of vectors that are nearest neighbors to the centroid for that node. - When the tree is balanced, the depth of the tree is proportional to the logarithm of the number of windows and the number of windows is proportional to the size of the database. - Average complexity of tree construction: O(nlogn).

SST: The Search Procedure Begin at the root of the tree. Nodes are represented by their respective centroid. Select the branch whose centroid is the lesser distance from the query vector. Proceed recursively until reaching a terminal node. The vectors in the terminal node represent the database windows which are the nearest neighbors to the query window.

SST: The Search Procedure Query window: AGCCTG Equal to windows size Vector: 001001010100000 Norman Casagrande 2003 – IFT 6291 AB if d A > d B, follow branch B dAdA dBdB

SST: Time Complexity Construction of the index O(nlogn) Search O(mlogn)

Computational Results Compare the computation time per query between BLAST and SST. - For search along, SST is 27 times faster than BLAST while for building the tree index and searching it, SST is 15 times faster than BLAST for the database of 120,000 sequences when query windows do not overlap. - For search along, SST is 13.2 times faster than BLAST while for building the tree index and searching it, SST is 9.3 times faster than BLAST for the database of 120,000 sequences when query windows overlap. - A higher speed up is expected for larger databases.

Discussion SST is most effective for applications in which the target sequences show a high degree of similarity to the query sequence, such as shotgun sequences or matching ESTs to genomic sequence. The accuracy is greatly improved when query windows overlap, but it will substantially slowdown the algorithm.

Homework 7: Based on the current SST algorithm, describe strategies to further improve the speed and space complexity of the algorithm. SST is designed for fast searches of similar sequences, discuss any drawbacks it may have.

SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.

Similar presentations

Presentation on theme: "SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.

Similar presentations

Presentation on theme: "SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael."— Presentation transcript:

Similar presentations

About project

Feedback