SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu Computer Engineering Department, Middle East Technical University Ankara, TURKEY
SISAP’08 – Outline Background –Sequence Alignment –Blast Embedding Subsequences –Fastmap, LMDS –Analysis of parameters to achieve stable and accurate mapping Indexing Subsequences 2
SISAP’08 – Sequence Similarity Search Sequence similarity search is at the heart of bioinformatics research –Similarity information allows: structural, functional, and evolutionary inferences 3
SISAP’08 – Sequence Alignment Goal: maximize “alignment score” Score of aligning two residues: –Substitution matrix Optimal solution: Dynamic Programming –Global: Needleman-Wunsch (1970) –Local: Smith-Waterman (1981) 4
SISAP’08 – Blast (Basic Local Alignment Search Tool) Popular tool for similarity search in sequence databases 1)Generate “k-tuples” (“k-mers”, “words”) from query CDEFG CDE, DEF, EFG CDE ADE,CDC,CCE, CDE, … 2)Find (exact) matching k-tuples in the database 3)For each candidate sequence, extend the k- tuple match in both directions. 5
SISAP’08 – Time-accuracy trade-off Challenge: –Allow flexible matching for larger words at reasonable time 6 123…411 k: Too many k-tuple hits to process Slows down the extension phase Few/none k-tuple hits Fast execution Exact k-tuple matching not sensitive Too many false negatives Proteins (20 3 tuples)DNA (4 11 tuples)
SISAP’08 – Raising the bar for k 1.Map k-tuples to a vector space Mapping cannot be perfect, thus “approximate results” 2.Use Spatial Access Methods (e.g. R-tree, X- tree) to index and retrieve k-tuples 7
SISAP’08 – Mapping k-tuples Requirements: –Need to support out of sample extension –Speed Candidate methods: –Fastmap (Faloutsos, 1995) –Landmark MDS (de Silva, 2003) 8
SISAP’08 – Fastmap 1.Select two pivots Distant pivots heuristic 2.Obtain projection using cosine law 3.Project objects to new hyperplane 4.Repeat 9
SISAP’08 – Fastmap Fast! O(Nd) –N: number of data points –d is the target dimensionality For query, need only to calculate distances to set of pivots Unstable (esp. if original space is non- Euclidean) 10
SISAP’08 – Landmark MDS 1.Select n landmarks (pivots) 2.Embed landmarks using classical MDS 3.For the remaining objects, apply distance-based triangulation based on distances to landmarks 11
SISAP’08 – Landmark MDS Provides stable results Good selection of landmarks is critical. –LMDS random –LMDS maxmin Add new landmarks that maximizes the minimum distance to already selected landmarks –LMDS fastmap Use the same landmarks as found by Fastmap 12
SISAP’08 – Evaluation Synthetic datasets –Randomly generate k-tuples for a given k and alphabet size σ Real dataset –Yeast proteins benchmark (σ=20) –6,341 proteins, 2.9 million residues –103 query proteins, residues Weighted Hamming distance CB-EUC substitution matrix (Sacan, 2007) 13
SISAP’08 – Sammon’s metric stress: Breaking point dimensionality 14 Target dimensionality (d) k=5, synthetic dataset, identity matrix
SISAP’08 – Subsequence length (k) and alphabet size (σ) 15
SISAP’08 – Number of landmarks 16 k=5, d=7, synthetic dataset, identity matrix
SISAP’08 – Approximate k-tuple search performance Find all k-tuples within a specified radius from a query k-tuple 17 k=6, d=8, real dataset, CB-EUC matrix
SISAP’08 – Homology search 18 k=6, d=8, real dataset, CB-EUC matrix
SISAP’08 – Search time 19 search radius=7Database size=100,000
SISAP’08 – Conclusion Applied an embedding-based approach to approximate sequence similarity search for the first time Significant time improvements with negligible degradation in accuracy Achieved more stable embedding with combined pivot selection strategy Defined intrinsic Euclidean dimensionality of the dataset 20