Similarity Methods C371 Fall 2004.

Slides:



Advertisements
Similar presentations
JKlustor clustering chemical libraries presented by … maintained by Miklós Vargyas Last update: 25 March 2010.
Advertisements

ChemAxon in 3D Gábor Imre, Adrián Kalászi and Miklós Vargyas Solutions for Cheminformatics.
Analysis of High-Throughput Screening Data C371 Fall 2004.
Multimedia Database Systems
3D Molecular Structures C371 Fall Morgan Algorithm (Leach & Gillet, p. 8)
Dimensionality Reduction PCA -- SVD
CLUSTERING PROXIMITY MEASURES
Extended Gaussian Images
3D Shape Histograms for Similarity Search and Classification in Spatial Databases. Mihael Ankerst,Gabi Kastenmuller, Hans-Peter-Kriegel,Thomas Seidl Univ.
Comparison Methodologies. Evaluating the matching characteristics Properties of the similarity measure Robustness of the similarity measure – Low variation.
Data Mining Techniques: Clustering
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Xianfeng Gu, Yaling Wang, Tony Chan, Paul Thompson, Shing-Tung Yau
66: Priyanka J. Sawant 67: Ayesha A. Upadhyay 75: Sumeet Sukthankar.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
The Capacity of Color Histogram Indexing Dong-Woei Lin NTUT CSIE.
Two Examples of Docking Algorithms With thanks to Maria Teresa Gil Lucientes.
Summary Molecular surfaces QM properties presented on surface Compound screening Pattern matching on surfaces Martin Swain Critical features Dave Whitley.
Quantitative Structure-Activity Relationships (QSAR) Comparative Molecular Field Analysis (CoMFA) Gijs Schaftenaar.
Hinrich Schütze and Christina Lioma
Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.
Basic Concepts and Definitions Vector and Function Space. A finite or an infinite dimensional linear vector/function space described with set of non-unique.
Distance Measures Tan et al. From Chapter 2.
CS292 Computational Vision and Language Visual Features - Colour and Texture.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Automated Drawing of 2D chemical structures Kees Visser.
Fast multiresolution image querying CS474/674 – Prof. Bebis.
RAPID: Randomized Pharmacophore Identification for Drug Design PW Finn, LE Kavraki, JC Latombe, R Motwani, C Shelton, S Venkatasubramanian, A Yao Presented.
Distance Measures Tan et al. From Chapter 2. Similarity and Dissimilarity Similarity –Numerical measure of how alike two data objects are. –Is higher.
1 Chemical Structure Representation and Search Systems Lecture 6. Nov 18, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
AdvisorStudent Dr. Jia Li Shaojun Liu Dept. of Computer Science and Engineering, Oakland University 3D Shape Classification Using Conformal Mapping In.
Pharmacophore and FTrees
Molecular Modeling Part I Molecular Mechanics and Conformational Analysis ORG I Lab William Kelly.
Molecular Descriptors
Hubert CARDOTJY- RAMELRashid-Jalal QURESHI Université François Rabelais de Tours, Laboratoire d'Informatique 64, Avenue Jean Portalis, TOURS – France.
Combinatorial Chemistry and Library Design
Ch 23 pages Lecture 15 – Molecular interactions.
Topological Summaries: Using Graphs for Chemical Searching and Mining Graphs are a flexible & unifying model Scalable similarity searches through novel.
Leiden University. The university to discover. Enhancing Search Space Diversity in Multi-Objective Evolutionary Drug Molecule Design using Niching 1. Leiden.
Fusing database rankings in similarity-based virtual screening Peter Willett, University of Sheffield.
V. Space Curves Types of curves Explicit Implicit Parametric.
Shape Analysis and Retrieval Structural Shape Descriptors Notes courtesy of Funk et al., SIGGRAPH 2004.
COLOR HISTOGRAM AND DISCRETE COSINE TRANSFORM FOR COLOR IMAGE RETRIEVAL Presented by 2006/8.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
A Graph-based Friend Recommendation System Using Genetic Algorithm
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
3D- QSAR. QSAR A QSAR is a mathematical relationship between a biological activity of a molecular system and its physicochemical parameters. QSAR attempts.
SAR vs QSAR or “is QSAR different from SAR”
Virtual Screening C371 Fall INTRODUCTION Virtual screening – Computational or in silico analog of biological screening –Score, rank, and/or filter.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Clustering.
2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.
Chapter 2: Getting to Know Your Data
PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729.
An Approximate Nearest Neighbor Retrieval Scheme for Computationally Intensive Distance Measures Pratyush Bhatt MS by Research(CVIT)
Selecting Diverse Sets of Compounds C371 Fall 2004.
Interacting Molecules in a Dense Fluid
Similarity Measures Spring 2009 Ben-Gurion University of the Negev.
Use of Machine Learning in Chemoinformatics
4. Molecular Similarity. 2 Similarity and Searching Historical Progression Similarity Measures Fingerprint Construction “Pathological” Cases MinMax- Counts.
Computational Approach for Combinatorial Library Design Journal club-1 Sushil Kumar Singh IBAB, Bangalore.
A Methodology for automatic retrieval of similarly shaped machinable components Mark Ascher - Dept of ECE.
INFO Week 7 Indexing and Searching Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.
Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan,
Lecture 2-2 Data Exploration: Understanding Data
Similarity and Dissimilarity
School of Computer Science & Engineering
Virtual Screening.
Group 9 – Data Mining: Data
Presentation transcript:

Similarity Methods C371 Fall 2004

Limitations of Substructure Searching/3D Pharmacophore Searching Need to know what you are looking for Compound is either there or not Don’t get a feel for the relative ranking of the compounds Output size can be a problem

Similarity Searching Look for compounds that are most similar to the query compound Each compound in the database is ranked In other application areas, the technique is known as pattern matching or signature analysis

Similar Property Principle Structurally similar molecules usually have similar properties, e.g., biological activity Known also as “neighborhood behavior” Examples: morphine, codeine, heroin Define: in silico Using computational techniques as a substitute for or complement to experimental methods

Advantages of Similarity Searching One known active compound becomes the search key User sets the limits on output Possible to re-cycle the top answers to find other possibilities Subjective determination of the degree of similarity

Applications of Similarity Searching Evaluation of the uniqueness of proposed or newly synthesized compounds Finding starting materials or intermediates in synthesis design Handling of chemical reactions and mixtures Finding the right chemicals for one’s needs, even if not sure what is needed.

Subjective Nature of Similarity Searching No hard and fast rules Numerical descriptors are used to compare molecules A similarity coefficient is defined to quantify the degree of similarity Similarity and dissimilarity rankings can be different in principle

Similarity and Dissimilarity “Consider two objects A and B, a is the number of features (characteristics) present in A and absent in B, b is the number of features absent in A and present in B, c is the number of features common to both objects, and d is the number of features absent from both objects. Thus, c and d measure the present and the absent matches, respectively, i.e., similarity; while a and b measure the corresponding mismatches, i.e., dissimilarity.” (Chemoinformatics; A Textbook (2003), p. 304)

2D Similarity Measures Commonly based on “fingerprints,” binary vectors with 1 indicating the presence of the fragment and 0 the absence Could relate structural keys, hashed fingerprints, or continuous data (e.g., topological indexes that take into acount size, degree of branching, and overall shape)

Tanimoto Coefficient Tanimoto Coefficient of similarity for Molecules A and B: SAB = c _ a + b – c a = bits set to 1 in A, b = bits set to 1 in B, c = number of 1 bits common to both Range is 0 to 1. Value of 1 does not mean the molecules are identical.

Similarity Coefficients Tanimoto coefficient is most widely used for binary fingerprints Others: Dice coefficient Cosine similarity Euclidean distance Hamming distance Soergel distance

Distance Between Pairs of Molecules Used to define dissimilarity of molecules Regards a common absence of a feature as evidence of similarity

When is a distance coefficient a metric? Distance values must be zero or positive Distance from an object to itself must be zero Distance values must be symmetric Distance values must obey the triangle inequality: DAB ≤ DAC + DBC Distance between non-identical objects must be greater than zero. Dissimilarity = distance in the n-dimensional descriptor space

Size Dependency of the Measures Small molecules often have lower similarity values using Tanimoto Tanimoto normalizes the degree of size in the denominator: SAB = c _ a + b – c

Other 2D Descriptor Methods Similarity can be based on continuous whole molecule properties, e.g. logP, molar refractivity, topological indexes. Usual approach is to use a distance coefficient, such as Euclidean distance.

Maximum Common Subgraph Similarity Another approach: generate alignment between the molecules (mapping) Define MCS: largest set of atoms and bonds in common between the two structures. A Non-Polynomial- (NP)-complete problem: very computer intensive; in the worst case, the algorithm will have an exponential computational complexity Tricks are used to cut down on the computer usage

Maximum Common Subgraph

Reduced Graph Similarity A structure’s key features are condensed while retaining the connections between them Cen ID structures with similar binding characteristics, but different underlying skeletons Smaller number of nodes speeds up searching

3D Similarity Aim is often to identify structurally different molecules 3D methods require consideration of the conformational properties of molecules

Tanimoto Coefficient to Find Compounds Similar to Morphine

3D: Alignment-Independent Methods Descriptors: geometric atom pairs and their distances, valence and torsion angles, atom triplets Consideration of conformational flexibility increases greatly the compute time Relatively fewer pharmacophoric fingerprints than 2D fingerprints Result: Low similarity values using Tanimoto

Pharmacophore A structural abstraction of the interactions between various functional group types in a compound Described by a spatial representation of these groups as centers (or vertices) of geometrical polyhedra, together with pairwise distances between centers http://www.ma.psu.edu/~csb15/pubs/searle.pdf

3D: Alignment Methods Require consideration of the degrees of freedom related to the conformational flexibility of the molecules Goal: determine the alignment where similarity measure is at a maximum

3D: Field-Based Alignment Methods Consideration of the electron density of the molecules Requires quantum mechanical calculation: costly Property not sufficiently discriminatory

3D: Gnomonic Projection Methods Molecule positioned at the center of a sphere and properties projected on the surface Sphere approximated by a tessellated icosahedron or dodecahedron Each triangular face is divided into a series of smaller triangles

Finding the Optimal Alignment Need a mechanism for exploring the orientational (and conformational) degrees of freedon for determining the optimal alignment where the similarity is maximized Methods: simplex algorithm, Monte Carlo methods, genetic alrogithms

Evaluation of Similarity Methods Generally, 2D methods are more effective that 3D 2D methods may be artificially enhanced because of database characteristics (close analogs) Incomplete handling of conformational flexibility in 3D databases Best to use data fusion techniques, combining methods

For additional information . . . See Dr. John Barnard’s lecture at: http://www.indiana.edu/~cheminfo/C571/c571_Barnard6.ppt