1 Chemical Structure Representation and Search Systems Lecture 6. Nov 18, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.

1 Chemical Structure Representation and Search Systems Lecture 6. Nov 18, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software & Consultancy Services Sheffield, UK

2 Lecture 6: Topics to be Covered  Similarity searching similarity search vs. substructure search similarity and distance metrics different types of descriptor for similarity search choice of descriptors  Chemical Diversity and its measurement

3 Similarity searching  instead of searching for all molecules containing a given substructure, we search for molecules “similar” to a given target molecule  similar property principle: “structurally similar molecules are expected to exhibit similar properties or biological activities” Mark Johnson and Gerry Maggiora (Eds.) Concepts and Applications of Molecular Similarity. Wiley, New York, 1990

4 What is similarity? “Similarity is in the eye of the beholder” Similarity can be measured in many different ways equivalence classes o can say that two molecules are similar, or that they are different numerical measures o can say that two molecules have a similarity of, e.g. 0.85 o similarity coefficients usually have values between 0.0 (totally different) and 1.0 (identical) distance measures o “opposite” of similarity (0.0 = identical; may have no maximum, or may be normalised to fix maximum limit)

5 Equivalence classes  All molecules which are identical at some level of description are considered equivalent molecular formula structure graph (with no distinction between node and bond types) reduced graph same ring systems same fingerprints

6 Equivalence classes  two different molecules with the same graph if node and edge labels are ignored

7 Numerical similarity measures  normally calculate some numerical measure of similarity between molecules  query structure is a “target” molecule  database structures can be ranked in decreasing order of similarity to target find all molecules with > threshold similarity to target find N most similar molecules to target  no particular substructure is required in the retrieved molecules but they will have structural features in common with target

8 Similarity from fingerprints  similarity measures are most commonly calculated from structure fingerprints count the bits that are “on” in both molecules count the bits that are “on” in each molecule separately struct A:00010100010101000101010011110100 13 bits on (A) struct B:00000000100101001001000011100000 8 bits on (B) A AND B:00000000000101000001000011100000 6 bits on (C) similarity coefficient can be calculated from A, B and C A B C

9 Tanimoto coefficient  similarity = C A + B – C = 6 / (13 + 8 – 6) = 0.4  the number of bits set in both molecules divided by the number of bits set in either molecule  The Tanimoto coefficient is the most commonly used similarity coefficient in chemical informatics also called the Jaccard coefficient A B C

10 Dice coefficient  similarity = 2C A + B = 12 / (13 + 8) = 0.57  does not give the same values as the Tanimoto coefficient, but will rank molecules in the same order of similarity to a target i.e. “monotonic” with the Tanimoto coefficient  also called the Czekanowski or Sørenson coefficient A B C

11 Cosine coefficient  similarity = C  (A  B) = 6 /  (13  8) = 0.588  not monotonic with the Tanimoto and Dice coefficients, but highly correlated with them also called the Ochiai coefficient A B C

12 What is similarity?  The three coefficients discussed so far all ignore bits that are off in both molecules is common absence of features evidence of similarity between them? are a camel, a horse and a nematode similar because they all lack wings? are a bat, a heron and a dragonfly similar because they all have wings?

13 What is similarity?  Are these molecules similar because they all lack heteroatoms?

14 Simple Matching coefficient  a similarity coefficient that takes into account the fingerprint bits that are off in both molecules (D)  similarity = C + D N = (6 + 17) / 32 = 0.719  N is length of fingerprint N = A + B – C + D A B C D

15 Asymmetric Similarity  Usually we think that if A is similar to B, then B is similar to A  some coefficients have been defined in which this is not true S(A,B)  S(B,A) e.g. Tversky similarity Similarity = C  (A – C) +  (B – C) + C where  and  are user-defined parameters

16 Asymmetric (Tversky) Similarity T = C  (A – C) +  (B – C) + C  if  =  = 1, equation reduces to Tanimoto coefficient  if  =  = ½, equation reduces to Dice coefficient  if   , T becomes asymmetric where  = 1 and  = 0, T = C / A i.e. the fraction of A which it is has in common with B o when T = 1.0, it indicates that A is a substructure of B (at the level of fingerprint matching) o when T  1.0 it indicates that A is almost a substructure of B A B C

17 Subsimilarity search  This provides a means to substructure similarity search also possible with maximal common subgraphs A and B could be number of atoms in each molecule, and C could be number of atoms in their maximal common substructure  fingerprint-based similarity is generally faster than identifying MCS but common features (fragments) will be smaller

18 Similarity and Distance  Distance is the opposite of similarity  A similarity coefficient in the range 0 to 1 can be converted to a distance by taking its “complement” Distance = 1 – Similarity  Sometimes there is a different name for the complement of a similarity coefficient: 1 – Tanimoto Coefficient = Soergel Distance 1 – Simple Matching Coefficient = Normalised Hamming Distance

19 Distance Coefficients  analogous to distances in multi-dimensional geometric space not necessarily equivalent to such distances  some distance coefficients are called distance metrics to be a metric, a distance coefficient has to obey certain rules

20 Distance metrics  distances must be zero or positive (no upper limit)  distance from object to itself must be zero  distance between non-identical objects must be greater than zero  distances must be symmetric  distances must obey the triangular inequality D A,B <= D A,C + D B,C

21 Properties of distance coefficients  not all distance coefficients are metrics  those that are have certain advantages, and some assumptions can be made about their behaviour e.g. plotting in multi-dimensional space  metric distance coefficients include Hamming Distance Soergel Distance (= 1– Tanimoto similarity)  non-metric distance coefficients include 1– Dice coefficient 1– Cosine coefficient

22 Continuous variables  Similarities and distances can be defined when the descriptors used are continuous variables instead of “on/off” fingerprint bits (dichotomous variables) these might be a set of property values for a molecule o molecular weight o number of rotatable bonds o number of potential hydrogen-bond donors/acceptors o solubility o acid dissociation constant o etc.  because these properties have different ranges, they may need to be “normalised” to range 0–1

23 Continuous variables  most similarity coefficients are also defined for continuous variables dichotomous variables are a special case when all values are either 0 or 1  but metric properties may not be the same for continuous variables e.g. 1 – Tanimoto is not metric where continuous variables are used

24 Euclidean distance  “ordinary distance”  each of n variables (descriptors) is a dimension in n-dimensional space  distance between A and B is given by x jA = value of descriptor j in molecule A x jB = value of descriptor j in molecule B  for dichotomous variables (Euclidean distance) 2 = Hamming Distance

25 Correlated variables  Sometimes different variables may be highly correlated with each other  statistical techniques can be used to identify such correlations, and reduce the number of variables e.g. Principal Components Analysis (PCA) o combines correlated descriptors into “components” o each component “explains” a certain proportion of the variance in the dataset  just the first few principal components are used to calculate similarities also easier to visualise plots in 3 dimensions!

26 Descriptor types used  Many different fragment types have been used for generating fingerprints for use in similarity searching atom sequence o linear path of atoms and bonds through molecule o may generate only paths of certain lengths augmented atom o atom and its immediate neighbours

27 Fragment types ring composition o atom/bond sequence around a ring o question of which rings to choose ring fusion patterns o sequence of ring connectivities around a ring for each atom specify number of ring bonds it has

28 Fragment types atom pairs o pair of atoms in same molecule, with number of bonds in shortest path between them o additional differentiation between atom types number of attached hydrogens / pi-bonds topological torsions o connected sequence of 4 atom types o atom types as described for atom pairs

29 Generalised fragments  sometimes specific fragments (with detailed description of atom and bond types) are too specific to be of much use in fingerprints very low frequency  very sparse fingerprints  atom and bond types can be generalised any ring bond any halogen (F, Cl, Br, I) any chalcogen (O, S, Se, Te)  this gives fragments with higher frequency

30 3D fragments  fragments can be used to describe the 3D structure of a molecule too usually involve interatomic distances and/or bond angles because distance values are continuous variables, they are “binned” o each bin represents a range of distances o e.g. distance of 3.000 – 3.999 Å each bin corresponds to a fingerprint bit

31 3D fragments  a popular 3D descriptor is the 3-point pharmacophore  molecule is analysed to identify “pharmacophoric points” points in molecule likely to be involved in binding to a receptor site o positive charges o negative charges o hydrogen-bond donors (e.g. –OH, – NH 2 ) o hydrogen-bond acceptors (e.g. =O) o aromatic groups o hydrophobic groups  pharmacophoric points do not necessarily coincide with the positions of individual atoms

32 3D fragments  each fragment consists of 3 pharmacophoric points the distances between each pair of these points are binned and used to set fingerprint bits  4-point pharmacophore fragments are also used  Different people have used slightly different definitions of pharmacophoric points

33 3D fragments  an issue with 3D fragments is conformational flexibility a molecule with a single configuration may adopt a number of different conformations, by rotation about single bonds some conformations may be energetically more favourable than others some programs for calculation of 3D coordinates from 2D (topological) representations can generate several different conformers different pharmacophoric fragments may result these can be “overcoded” (ORed together in the fingerprint) but this can cause problems

34 Choice of descriptors “similarity is in the eye of the beholder”  obviously the similarity values obtained will depend heavily on the set of descriptors used choose ring-based fragments and the most similar structures will be those with similar ring systems, irrespective of functional groups attached to them choose small (functional group-like) fragments and the most similar structures will be those with the same functional groups, irrespective of the ring systems they are attached to

35 Choice of descriptors  A danger with fragment-based descriptors is redundancy of fragments different fragments may be representing the same chemical features o this will set more bits in the fingerprints when those features are present in a molecule o and this may give the molecules a higher similarity than is warranted hashed fingerprints may result in molecules with no features in common appearing to be similar because different fragments collide on the same bit position

36 What makes a good descriptor?  how do we decide which descriptors and similarity or distance measures are “best”?  go back to “similar property” principle: “structurally similar molecules are expected to exhibit similar properties or biological activities”  we can do some experiments using various different sets of descriptors dataset of compounds with known biological activity or measured physico-chemical property value

37 Evaluation of descriptor sets and similarity measures for each compound in the dataset “predict” its property value to be the same as those with which it is most similar o usually take the average of 3 or 5 nearest neighbours (“k-nearest neghbours prediction”) calculate the correlation coefficient between the observed and predicted property values  on this basis most similarity coefficients perform about the same for no very good reason Tanimoto measure has become the most popular

38 Evaluation of descriptor sets  results on different descriptor sets are more ambiguous likely to be heavily influenced by the property to be predicted most pharmaceutical companies are interested in predicting which compounds will bind well to a protein receptor site  different types of descriptors each have their own advocates (usually their inventors)  main argument is between “2D” and “3D”

39 2D vs 3D descriptors  advocates of 3D descriptors argue that in binding to a protein receptor it is the 3D arrangement of the molecule that is important  experimental work done at Abbott Laboratories in the mid-1990s found that fingerprints based on 2D fragments performed consistently better best of all were the fingerprints used in MDL’s ISIS software (“MACCS keys” or “ISIS keys”) o these are based on simple functional groups and rings o one version has only 166 bits, with substantial redundancy

40 2D vs 3D descriptors  problem may be that 3D descriptors we have (3- point pharmacophores with binned distances) are not good enough there may be “spurious accuracy” in the detailed distances involved conformational flexibility may be causing problems (as one distance gets larger, another gets smaller) molecule may change conformation during binding some improved success has been found by identifying “projection points” for hydrogen bond donors/acceptors (i.e. where they’re pointing to, not where they are)

41 2D vs 3D descriptors  2D descriptors provide “bounds” on possible 3D conformations “2½D” descriptors (including some stereochemical information) may be useful  “Superiority” of 2D descriptors in some studies may be artifact of datasets used datasets may have large numbers of close analogues these will have high 2D similarity, as well as correlated activity

42 Field-based 3D similarity  Another approach to similarity studies  Based on overlap of continuously-varying fields (e.g. electron density)  Computationally much more intensive Calculation of 3D structures and fields Alignment of molecules to measure overlap Use of grid points in overlap calculations

43 Chemical Diversity  important feature of compound collections and combinatorial libraries  idea is to cover as much of “chemical space” as possible lots of different structural features represented  avoid having too many similar compounds no point in testing different compounds likely to have the same properties

44 Chemical diversity measures  numerical measure of the diversity of a set of compounds several different measures in use no real agreement on the best measure frequently based on similarity/distance between pairs of compounds o average distance between every pair of compounds algorithms are available to calculate this in O(N) time short distances can be compensated by long distances o minimum distance between any pair of compounds requires all pairwise distances to be calculated (O(N 2 )) o many other more complex measures

45 Cell-based diversity  Usually used with descriptors based on continuous variables “chemical space” is divided into a “grid” o one dimension for each descriptor o one grid square (“cell”) for each range of descriptor values compounds are placed in appropriate cell (hypercube) for their descriptor values diversity an be measured by counting occupied/ unoccupied cells, and calculating average occupancy of each cell

46 Diverse subset selection  Frequently similarity measures (and other criteria) are used to identify, from a large number of compounds that could be synthesised for testing, a suitably diverse subset that will be synthesised and tested this is particularly important with combinatorial libraries o first design a large “virtual” library o then identify a subset “real” library to synthesise

47 Virtual library subsetting acid + amine  amide  virtual library has 1000 possible acids and 1000 possible amines, giving 1M amides  we only want to make and test 900 amides we need 30 acids and 30 amines  we select diverse subsets (e.g. to maximise the minimum distance between any pair of compounds) of the acids and the amines may be better off identifying the most diverse 900- compound subset of the amides

48 Virtual library subsetting  Number of possible subsets of 900 from 1 million is vast far too many to try them all  “Genetic algorithms” often used “chromosomes” represent possible subsets they are “mutated” etc. at each “generation” to try to get better subsets “fitness” functions measure diversity and other characteristics (cost, “drug-likeness” etc.)

49 Summary of Lecture 6  similarity searching is an important alternative to substructure search  there are many different ways of measuring the similarity or distance between molecules  similarity can be measured with respect to many different types of structure descriptor there is no general agreement on the “best” descriptors  similarity and distance measures can be used to identify compounds to be synthesised and tested as potential new drugs  similarity is the basis for measuring chemical diversity useful concept in identifying subsets of a large dataset which cover as much as possible of chemical space Barnard, J. M.; Downs, G. M.; Willett, P. J. Chem. Inf. Comput. Sci., 1998, 38, 983-996 Leach and Gillet (2003) Chapters 3, 4, 5 and 6

50 Lecture 7: Topics to be Covered  Clustering identifying classes of molecules similar to each other, but different to those in other classes  Topological indexes numbers that can be calculated from connection tables  Property prediction predicting physicochemical or biological properties directly from connection tables  The Drug Discovery Process

1 Chemical Structure Representation and Search Systems Lecture 6. Nov 18, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.

Similar presentations

Presentation on theme: "1 Chemical Structure Representation and Search Systems Lecture 6. Nov 18, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Chemical Structure Representation and Search Systems Lecture 6. Nov 18, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.

Similar presentations

Presentation on theme: "1 Chemical Structure Representation and Search Systems Lecture 6. Nov 18, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software."— Presentation transcript:

Similar presentations

About project

Feedback