4. Molecular Similarity. 2 Similarity and Searching Historical Progression Similarity Measures Fingerprint Construction “Pathological” Cases MinMax- Counts.

Slides:



Advertisements
Similar presentations
JKlustor clustering chemical libraries presented by … maintained by Miklós Vargyas Last update: 25 March 2010.
Advertisements

Object Recognition Using Locality-Sensitive Hashing of Shape Contexts Andrea Frome, Jitendra Malik Presented by Ilias Apostolopoulos.
Aggregating local image descriptors into compact codes
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Similarity Search on Bregman Divergence, Towards Non- Metric Indexing Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung.
Partitioned Elias-Fano Indexes
Searching on Multi-Dimensional Data
MIT CSAIL Vision interfaces Towards efficient matching with random hashing methods… Kristen Grauman Gregory Shakhnarovich Trevor Darrell.
Efficiently searching for similar images (Kristen Grauman)
Nonparametric Link Prediction in Dynamic Graphs Purnamrita Sarkar (UC Berkeley) Deepayan Chakrabarti (Facebook) Michael Jordan (UC Berkeley) 1.
Distance and Similarity Measures
Its All In The Bits Recent Advances in Information Boosting, Database Indexing and Molecular Similarity Searching Harold Helson and Andrew Smellie CambridgeSoft,
Chemical Diversity Qualify and/or quantify the extent of variety within a set of compounds. Try to define the extent of chemical space. In combinatorial.
Fast High-Dimensional Feature Matching for Object Recognition David Lowe Computer Science Department University of British Columbia.
Coherency Sensitive Hashing (CSH) Simon Korman and Shai Avidan Dept. of Electrical Engineering Tel Aviv University ICCV2011 | 13th International Conference.
Small Codes and Large Image Databases for Recognition CVPR 2008 Antonio Torralba, MIT Rob Fergus, NYU Yair Weiss, Hebrew University.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Given by: Erez Eyal Uri Klein Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor search Definition Definition Low dimensions Low dimensions.
Heuristic alignment algorithms and cost matrices
1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.
Video summarization by video structure analysis and graph optimization M. Phil 2 nd Term Presentation Lu Shi Dec 5, 2003.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
1 Lecture 18 Syntactic Web Clustering CS
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Exploring Chemical Space with Computers—Challenges and Opportunities Pierre Baldi UCI.
Y. Weiss (Hebrew U.) A. Torralba (MIT) Rob Fergus (NYU)
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
RAPID: Randomized Pharmacophore Identification for Drug Design PW Finn, LE Kavraki, JC Latombe, R Motwani, C Shelton, S Venkatasubramanian, A Yao Presented.
Efficient Image Search and Retrieval using Compact Binary Codes
Indexing Techniques Mei-Chen Yeh.
Algorithms on negatively curved spaces James R. Lee University of Washington Robert Krauthgamer IBM Research (Almaden) TexPoint fonts used in EMF. Read.
Similarity Methods C371 Fall 2004.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
A Small IP Forwarding Table Using Hashing Yeim-Kuan Chang and Wen-Hsin Cheng Dept. of Computer Science and Information Engineering National Cheng Kung.
An Approximate Nearest Neighbor Retrieval Scheme for Computationally Intensive Distance Measures Pratyush Bhatt MS by Research(CVIT)
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
Selecting Diverse Sets of Compounds C371 Fall 2004.
Outline Problem Background Theory Extending to NLP and Experiment
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Randomized Algorithms Part 3 William Cohen 1. Outline Randomized methods - so far – SGD with the hash trick – Bloom filters – count-min sketches Today:
Use of Machine Learning in Chemoinformatics
6 6.3 © 2016 Pearson Education, Inc. Orthogonality and Least Squares ORTHOGONAL PROJECTIONS.
Computational Approach for Combinatorial Library Design Journal club-1 Sushil Kumar Singh IBAB, Bangalore.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:
KNN & Naïve Bayes Hongning Wang
S IMILARITY E STIMATION T ECHNIQUES FROM R OUNDING A LGORITHMS Paper Review Jieun Lee Moses S. Charikar Princeton University Advanced Database.
Hashing (part 2) CSE 2011 Winter March 2018.
Fast nearest neighbor searches in high dimensions Sami Sieranoja
Probabilistic Data Management
Lecture 11: Nearest Neighbor Search
Lecture 7: Dynamic sampling Dimension Reduction
Near(est) Neighbor in High Dimensions
Near-Optimal (Euclidean) Metric Compression
Locality Sensitive Hashing
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Compact routing schemes with improved stretch
Minwise Hashing and Efficient Search
Topological Signatures For Fast Mobility Analysis
Describing a crystal to a computer: How to represent and predict material structure with machine learning Keith T Butler.
Presentation transcript:

4. Molecular Similarity

2 Similarity and Searching Historical Progression Similarity Measures Fingerprint Construction “Pathological” Cases MinMax- Counts Pruning Search Space Aggregate Queries LSH

3 Historical Progression Maximum Common Subgraph-Isomorphism (MCS) –maximum common substructure between to molecules. –“NP-complete” Structural Keys –dictionary of predetermined, domain-specific sub-structures keyed to particular positions in a bit-vector constructed for each molecule –similarity computed between bit-vectors (fast O(D) scan) 2D Compressed Fingerprints –ALL substructures stored in a bit-vector using a hashing scheme plus lossy compression (modulo operator) –Similarity computed between bit-vectors or count vectors Faster Searches –database pruning –locality sensitive hashing (LSH): towards O(log n) similarity searching

4 Superstructure and Substructure Searches A is a superstructure of B (ignoring H) B is a substructure of A Tversky similarity A B

5 How similar? The Similarity Problem

6 Spectral Similarity 1.Count substructures 2.Compare the count/bit vectors

7 2D Graph Substructures For chemical compounds –atom/node labels: A = {C,N,O,H, … } –bond/edge labels: B = {s, d, t, ar, … } Trace ALL Paths O(N*d l ) Cycles and trees Combinatorial Space (CsNsCdO)

8 Mapping Structures to Bits Compact data representation Hash each path to bit vector Feature space → Bit space Resolve clashes with OR operator (i.e 1+1=1)

9 Similarity Measures There are many ways of measuring similarity (or distance) between bit/count vectors: –Euclidean –Cosine –Exponentials –Tanimoto/Jaccard –Tversky –MinMax –And many more (L1,L2,Lp,Hamming, Manhattan,….)

10

11

12 Similarity Measures: Tanimoto Tally features: –Unique (a,b) –Both on (c) –Both off (d) Similarity Formula –Tanimoto=c/(a+b+c) acb A B

13 Fingerprint bit similarity approximates chemical feature similarity. The Fingerprint Approximation

14 Similarity Measures: Tversky Tally features: –Unique (a,b) –Both on (c) –Both off (d) Similarity Formula –Tanimoto=c/(a+b+c) –Tversky(α,β)=c/(αa+βb+c) acb A B

15 Pathological Cases On the Properties of Bit String-Based Measures of Chemical Similarity. Flower DR, J. Chem. Inf. Comput. Sci. 1998, 38,

16 Pathological Cases Issue of labeling scheme.

17 MinMax similarity is a generalization of Tanimoto which uses the counts. MinMax can work better than Tanimoto. Counts

18 Pruning Search Space Using Bounds Linear speedup (search CxD) for fixed threshold, often by one order of magnitude or more. Sub-linear speedup (search CxD 0.6 ) for top K.

19

20 Speedup from Pruning Speedup depends on: –Threshold –Query –Fingerprint length –Database size

21

22

23 Bias in Query Distribution

24

25

26 Aggregate Queries (“Profiles”)

27 Two Basic Strategies Similar to bioinformatics 1.Aggregate individual pairwise measures 2.Build a fingerprint profile –Linear approaches –Non-linear approaches (consensus, modal, etc) Hybrid (profile + aggregation/”scaling”)) Profile-profile

28 Aggregations

29 Consensus Fingerprints Create consensus fingerprint Search database using the consensus & =

30 Local Sensitive Hashing Bin fingerprints based on projections onto randomly directed vectors log D random vectors → O(log D) Search for neighbors by returning bin corresponding to the query’s projection Has been used for clustering. May be useful for building diverse data sets. Not yet developed for searching

31 Outline Historical Progression Similarity Measures Fingerprint Construction Pathologic Cases MinMax- Counts Pruning Search Space Aggregate Queries LSH