SISAP’08 – 20080411 Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email:

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Improved Alignment of Protein Sequences Based on Common Parts David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
3D Shape Histograms for Similarity Search and Classification in Spatial Databases. Mihael Ankerst,Gabi Kastenmuller, Hans-Peter-Kriegel,Thomas Seidl Univ.
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Searching Sequence Databases
Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Dimensionality Reduction and Embeddings
SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.
Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis.
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Dimensionality Reduction
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Protein Sequence Comparison Patrice Koehl
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Speed Up DNA Sequence Database Search and Alignment by Methods of DSP
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Sequence Alignment and Database Searching.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
An Approximate Nearest Neighbor Retrieval Scheme for Computationally Intensive Distance Measures Pratyush Bhatt MS by Research(CVIT)
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Construction of Substitution matrices
Bioinformatics Computing 1 CMP 807 – Day 2 Kevin Galens.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
FastMap : Algorithm for Indexing, Data- Mining and Visualization of Traditional and Multimedia Datasets.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Department of Computer Science
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu Computer Engineering Department, Middle East Technical University Ankara, TURKEY

SISAP’08 – Outline Background –Sequence Alignment –Blast Embedding Subsequences –Fastmap, LMDS –Analysis of parameters to achieve stable and accurate mapping Indexing Subsequences 2

SISAP’08 – Sequence Similarity Search Sequence similarity search is at the heart of bioinformatics research –Similarity information allows: structural, functional, and evolutionary inferences 3

SISAP’08 – Sequence Alignment Goal: maximize “alignment score” Score of aligning two residues: –Substitution matrix Optimal solution: Dynamic Programming –Global: Needleman-Wunsch (1970) –Local: Smith-Waterman (1981) 4

SISAP’08 – Blast (Basic Local Alignment Search Tool) Popular tool for similarity search in sequence databases 1)Generate “k-tuples” (“k-mers”, “words”) from query CDEFG  CDE, DEF, EFG CDE  ADE,CDC,CCE, CDE, … 2)Find (exact) matching k-tuples in the database 3)For each candidate sequence, extend the k- tuple match in both directions. 5

SISAP’08 – Time-accuracy trade-off Challenge: –Allow flexible matching for larger words at reasonable time 6 123…411 k: Too many k-tuple hits to process  Slows down the extension phase Few/none k-tuple hits  Fast execution Exact k-tuple matching not sensitive  Too many false negatives Proteins (20 3 tuples)DNA (4 11 tuples)

SISAP’08 – Raising the bar for k 1.Map k-tuples to a vector space Mapping cannot be perfect, thus “approximate results” 2.Use Spatial Access Methods (e.g. R-tree, X- tree) to index and retrieve k-tuples 7

SISAP’08 – Mapping k-tuples Requirements: –Need to support out of sample extension –Speed Candidate methods: –Fastmap (Faloutsos, 1995) –Landmark MDS (de Silva, 2003) 8

SISAP’08 – Fastmap 1.Select two pivots Distant pivots heuristic 2.Obtain projection using cosine law 3.Project objects to new hyperplane 4.Repeat 9

SISAP’08 – Fastmap Fast! O(Nd) –N: number of data points –d is the target dimensionality For query, need only to calculate distances to set of pivots Unstable (esp. if original space is non- Euclidean) 10

SISAP’08 – Landmark MDS 1.Select n landmarks (pivots) 2.Embed landmarks using classical MDS 3.For the remaining objects, apply distance-based triangulation based on distances to landmarks 11

SISAP’08 – Landmark MDS Provides stable results Good selection of landmarks is critical. –LMDS random –LMDS maxmin Add new landmarks that maximizes the minimum distance to already selected landmarks –LMDS fastmap Use the same landmarks as found by Fastmap 12

SISAP’08 – Evaluation Synthetic datasets –Randomly generate k-tuples for a given k and alphabet size σ Real dataset –Yeast proteins benchmark (σ=20) –6,341 proteins, 2.9 million residues –103 query proteins, residues Weighted Hamming distance CB-EUC substitution matrix (Sacan, 2007) 13

SISAP’08 – Sammon’s metric stress: Breaking point dimensionality 14 Target dimensionality (d) k=5, synthetic dataset, identity matrix

SISAP’08 – Subsequence length (k) and alphabet size (σ) 15

SISAP’08 – Number of landmarks 16 k=5, d=7, synthetic dataset, identity matrix

SISAP’08 – Approximate k-tuple search performance Find all k-tuples within a specified radius from a query k-tuple 17 k=6, d=8, real dataset, CB-EUC matrix

SISAP’08 – Homology search 18 k=6, d=8, real dataset, CB-EUC matrix

SISAP’08 – Search time 19 search radius=7Database size=100,000

SISAP’08 – Conclusion Applied an embedding-based approach to approximate sequence similarity search for the first time Significant time improvements with negligible degradation in accuracy Achieved more stable embedding with combined pivot selection strategy Defined intrinsic Euclidean dimensionality of the dataset 20