SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Greedy Algorithms Amihood Amir Bar-Ilan University.
Searching on Multi-Dimensional Data
Main Index Contents 11 Main Index Contents Week 6 – Binary Trees.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Introduction to Bioinformatics
Data Mining Association Analysis: Basic Concepts and Algorithms
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
SASH Spatial Approximation Sample Hierarchy
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
2-dimensional indexing structure
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
Heuristic alignment algorithms and cost matrices
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Scalable Data Mining The Auton Lab, Carnegie Mellon University Brigham Anderson, Andrew Moore, Dan Pelleg, Alex Gray, Bob Nichols, Andy.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
1. 2 General problem Retrieval of time-series similar to a given pattern.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
Efficient Distance Computation between Non-Convex Objects By Sean Quinlan Presented by Sean Augenstein and Nicolas Lee.
FLANN Fast Library for Approximate Nearest Neighbors
Speed Up DNA Sequence Database Search and Alignment by Methods of DSP
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Developing Pairwise Sequence Alignment Algorithms
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Genetic Algorithm.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Ch10 Machine Learning: Symbol-Based
Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.
Reference-Based Indexing of Sequence Databases (VLDB ’ 06) Jayendra Venkateswaran Deepak Lachwani Tamer Kahveci Christopher Jermaine Presented by Angela.
1 An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Spatial Query Processing Spatial DBs do not have a set of operators that are considered to be basic elements in a query evaluation. Spatial DBs handle.
INTERACTIVELY BROWSING LARGE IMAGE DATABASES Ronald Richter, Mathias Eitz and Marc Alexa.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
A Fast LBG Codebook Training Algorithm for Vector Quantization Presented by 蔡進義.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Algorithm Analysis with Big Oh ©Rick Mercer. Two Searching Algorithms  Objectives  Analyze the efficiency of algorithms  Analyze two classic algorithms.
ITree: Exploring Time-Varying Data using Indexable Tree Yi Gu and Chaoli Wang Michigan Technological University Presented at IEEE Pacific Visualization.
Applications of Tabu Search OPIM 950 Gary Chen 9/29/03.
Data Science Algorithms: The Basic Methods
Fast nearest neighbor searches in high dimensions Sami Sieranoja
Sequence comparison: Local alignment
Chapter 11 Indexing And Hashing (1)
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Data Mining CSCI 307, Spring 2019 Lecture 23
Presentation transcript:

SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael G. Walker Michael G. Walker James Z. Wang James Z. Wang Wayne Volkmuth Wayne Volkmuth Bioinformatics Vol. 18 no. 6 – 2002 Pages Norman Casagrande 2003 – IFT 6291

Outline Motivation Previous Related Research The SST Algorithm Computational Results Discussion

Motivation Searches for near-exact sequence matches are performed frequently in large-scale sequencing projects and in comparative genomics. The time and cost of performing these searches is prohibitive with current algorithms. Faster algorithms are desired.

Previous related research Needleman-Wunsch and Smith-Waterman These algorithms perform global and local sequence alignment using dynamic programming. Time complexity: O(mn). m = length of query sequence n = sum of lengths of all sequences in the database

Previous related research FASTA This algorithm identifies regions of local sequence similarity by first identifying candidate similar sequences based on shared k-tuples and performs local alignment with the Smith-Waterman algorithm. Time complexity: O(mn).

Previous related research BLAST This algorithm identifies regions of local sequence similarity by first identifying candidate similar sequences that have k-tuples in common with the query sequence, and then extending the regions of similarity. Time complexity: O(n).

The SST Algorithm The SST Algorithm Database partitioning with sliding windows Mapping windows into vector space Tree-structured index for database windows The search procedure

SST: Database Partitioning First step - Database partitioned into overlapping windows. - Fixed windows of length W. Typically: Measure of overlap parameter ∆ is typically: 5 ≤ ∆ ≤ W/2

SST: Database Partitioning AACCGGTTACGTACG... Norman Casagrande 2003 – IFT 6291 AACCGG W = 6 CCGGTT GGTTAC …. ∆ = 2

SST: Query Partitioning Query sequence partitioned into –Non-overlapping windows or –Windows which overlap by half of their length: Norman Casagrande 2003 – IFT 6291

SST: Mapping Windows into Vector For each window, create a vector which counts the number of occurrences of each k-tuple. Tuple size k: 2 – 10 Typically: 4 or 5 (empirically found)

SST: Mapping Windows into Vector Assume window: AAACAGATCACCCGCTGAGCGGGTTATCTGTT AACCGG k=2 → 16 occurrences Resulting vector Norman Casagrande 2003 – IFT 6291

SST: Creation of Tree-structured Index Distance between vectors as heuristic function for distance between sequences. A = (01), B = (10) d = Σ |A i - B i | = |(0-1)| + |(1-0)| = 2 Method: TSVQ (Gersho and Gray, 1992)

SST: Creation of Tree-structured Index Select two centroids X A and X B and their corresponding partitions of the data into disjoint set A and set B using the following iterative procedure: - Choose two initial values for X A and X B. - For each vector y in the database, compute the distance d from the vector to each of the centroids. Assign y to set A if d A < d B, and to set B otherwise. d A (y) = |X Ai - y i |, d B (y) = |X Bi - y i |

SST: Creation of Tree-structured Index - Compute the new centroids: where |A| = size of set A, |B| = size of set B X A = ΣyAyΣyAy |A| X B = ΣyByΣyBy |B|

SST: Creation of Tree-structured Index - Compute values for the terminating criteria: D A = Σ y  A d A (y) D B = Σ y  B d B (y) - Repeat until the change in D A and D B is less than a small threshold, or no vectors change partition.

SST: Creation of Tree-structured Index Recursively partition the set A and B generated above using the same algorithm. The recursion terminates when the number of vectors in a set is smaller than a specified tolerance or when the algorithm fails to fragment a cluster into two substantial new clusters.

SST: Creation of Tree-structured Index TSVQ: - Each leave contains the set of vectors that are nearest neighbors to the centroid for that node. - When the tree is balanced, the depth of the tree is proportional to the logarithm of the number of windows and the number of windows is proportional to the size of the database. - Average complexity of tree construction: O(nlogn).

SST: The Search Procedure Begin at the root of the tree. Nodes are represented by their respective centroid. Select the branch whose centroid is the lesser distance from the query vector. Proceed recursively until reaching a terminal node. The vectors in the terminal node represent the database windows which are the nearest neighbors to the query window.

SST: The Search Procedure Query window: AGCCTG Equal to windows size Vector: Norman Casagrande 2003 – IFT 6291 AB if d A > d B, follow branch B dAdA dBdB

SST: Time Complexity Construction of the index O(nlogn) Search O(mlogn)

Computational Results Compare the computation time per query between BLAST and SST. - For search along, SST is 27 times faster than BLAST while for building the tree index and searching it, SST is 15 times faster than BLAST for the database of 120,000 sequences when query windows do not overlap. - For search along, SST is 13.2 times faster than BLAST while for building the tree index and searching it, SST is 9.3 times faster than BLAST for the database of 120,000 sequences when query windows overlap. - A higher speed up is expected for larger databases.

Discussion SST is most effective for applications in which the target sequences show a high degree of similarity to the query sequence, such as shotgun sequences or matching ESTs to genomic sequence. The accuracy is greatly improved when query windows overlap, but it will substantially slowdown the algorithm.

Homework 7: Based on the current SST algorithm, describe strategies to further improve the speed and space complexity of the algorithm. SST is designed for fast searches of similar sequences, discuss any drawbacks it may have.