Optimized Virtual Screening

Optimized Virtual Screening
Slide 1 Optimized Virtual Screening Miklós Vargyas Zsuzsanna Szabó György Pirok Ferenc Csizmadia Matthias Steger Modest von Korff ChemAxon Ltd. AXOVAN AG Allschwil, Switzerland (Axovan is now Actelion.)

Drug research Is it searching for a needle in a haystack?
Slide 2 Drug research Is it searching for a needle in a haystack? corporate database structures found Drug research is often termed as searching for a needle in a haystack. Large compound libraries, corporate databases, virtual compound collections and supplier’s databases are routinely screened for drug candidates. The situation here is worse than finding a needle in a haystack, as we do not exactly know the shape of the needle. We have very limited knowledge about the “thing” we need to find. Often we do not know how it looks or even if it is there. Of course, there is a distinct possibility we may end up finding more than one.

Drug research Find something similar to a fistful of needles
Slide 3 Drug research Find something similar to a fistful of needles structures found (virtual hits) corporate database (targets) query structures (known actives) In virtual screening the target database is searched for structures that bear some similarity to a small set of query structures. Typically, these molecules exhibit similar biological/pharmacological activity, for instance they can bind to the same protein to form a receptor-ligand complex. The expectation in virtual screening is that structures in the database found to be similar to query structures are functional analogs and thus will have similar activity. (Note, that in this talk we focus on ligand base drug research where there is no direct information on the structure of the target protein available. Techniques like 3d docking lie outside of the scope of this work.)

Molecular similarity What is it? Chemical Pharmacophore
Slide 4 Molecular similarity What is it? Chemical, pharmacological or biological properties of two compounds match. The more the common features, the higher the similarity between two molecules. Chemical In the simplest case the question of molecular similarity is raised for two molecules. They are regarded as similar entities if either chemical/topological, pharmacological or biological properties match. The two structures on top are chemically similar to each other. This is reflected in their common sub-graph, or scaffold: they share 14 atoms. The two other structures at the bottom are less similar chemically (topologically) yet have the same pharmacological activity, namely they both are Angiotensin-Converting Enzyme (ACE) inhibitors. Pharmacophore

Molecular similarity How to calculate it?
Slide 5 Molecular similarity How to calculate it? Quantitative assessment of similarity/dissimilarity of structures need a numerically tractable form molecular descriptors, fingerprints, structural keys Sequences/vectors of bits, or numeric values that can be compared by distance functions, similarity metrics. Assessing the similarity or dissimilarity of two compounds is typically not done on structural level (though it is possible, for instance by the size of maximum common sub-graph), instead, structures are encoded into (or represented by) a set of values that are numerically tractable. There is a wide variety of such sets of values available: molecular descriptor, molecular fingerprints, structural keys are all well-known approaches. These can be represented by a multidimensional vector, and the similarity between the original structures can be expressed as the distance between these two vectors. The Euclidean distance is the most widely used example of such type of dissimilarity functions. Another family of such proximities regards the set of values as a series and calculates the ratio between matching and different values in the same position in the two series. A remarkable example is the Tanimoto coefficient.

Molecular descriptors
Slide 6 Molecular descriptors Example 1: chemical fingerprint hashed binary fingerprint encodes topological properties of the chemical graph: connectivity, edge label (bond type), node label (atom type) allows the comparison of two molecules with respect to their chemical structure Construction find all 0, 1, …, n step walks in the chemical graph generate a bit array for each walks with given number of bits set merge the bit arrays with logical OR operation Topological chemical fingerprints encode structural properties of the chemical graph as a sequence of bits. The encoding (hashing) is not reversible, thus two different structures can have the same fingerprint (therefore the name is not so adequate). If a certain feature is present in the structure, for instance a C-O-H pattern, then specific bits in the series are set invariably, however, these very same bits can be set by many other structural features. These properties of the hashed fingerprint make it suitable for structural comparison, substructure search and similarity search. There are various encoding schemas available, one takes all walks of a given maximum length in the chemical graph as patterns to be encoded. For each such structural pattern certain bits are ‘turned on’ in the fingerprint. Once a bit is turned on by a certain feature, it remains 1, other features cannot cancel it out.

Slide 7 Molecular descriptors Example 1: chemical fingerprint Example CH3 – CH2 – OH walks from the first carbon atom length walk bit array C 1 C – H C – C 2 C – C – H C – C – O 3 C – C – O – H This example illustrates how a 10 bits long topological chemical fingerprint is created for a simple chain structure. In this example all walks up to 3 steps are considered, and 2 bits are set for each pattern. merge bit arrays for the first carbon atom:

Slide 8 Molecular descriptors Example 1: chemical fingerprint Two β2 adrenoceptor agonist molecules, and their 64 bit hashed chemical fingerprints. These fingerprints clearly reflect the high structural similarity between the two compounds: there are only two bits that differ.

Slide 9 Molecular descriptors Example 2: pharmacophore fingerprint encodes pharmacophore properties of molecules as frequency counts of pharmacophore point pairs at given topological distance allows the comparison of two molecules with respect to their pharmacophore Construction map pharmacophore point type to atoms calculate length of shortest path between each pair of atoms assign a histogram to every pharmacophore point pairs and count the frequency of the pair with respect to its distance Bits in the hashed binary fingerprint cannot be interpreted, there is no way to infer properties of the original structure from its fingerprint. There are other ways to construct fingerprints other than hashing. An example is pharmacophore fingerprints where each pharmacophore point pair is associated with a histogram bar. Such a fingerprint is not binary, yet it allows fast comparison of the pharmacophore of chemical structures. The construction of such fingerprints is fairly straightforward. First, each atom is labeled with its pharmacophore type (e.g. hydrogen bond donor, hydrophobic, anionic etc). Then shortest paths between each point pairs are calculated. Then histograms are assigned to each pharmacophore type pairs (e.g. acceptor-acceptor, acceptor-donor, acceptor-hydrophobic etc). Each histogram has the same predefined number of bins, these belong to different topological distances considered (e.g. 1, 2, etc, up to 10). Bins store the number of the associated pharmacophore type pairs lying at the given topological distance.

Slide 10 Molecular descriptors Example 2: pharmacophore fingerprint Pharmacophore point type based coloring of atoms: acceptor, donor, hydrophobic, none. The pharmacophores of these two structures are the same (these are both ACE inhibitors). A topological cross-correlation pharmacophore fingerprint is constructed from the structures and the mapped pharmacophoric point types. (In this example only three different pharmacophore point types were considered: acceptor, donor and hydrophobic.) Each point type pair is counted in a corresponding histogram bin depending on their topological distance. (Topological distances from 1 to 6 bonds apart were considered.) The two histograms visibly represent the pharmacophoric similarity of the two compounds, especially specific (or hard) pharmacophore points (hydrogen bond acceptor and donor) related histograms (AA, DA) show significant similarity. Bars in the histograms of the second structure are higher due to the larger size of the corresponding molecular graph.

Virtual screening using fingerprints
Slide 11 Virtual screening using fingerprints Individual query structure query fingerprint query proximity Equipped with fingerprints and dissimilarity metrics virtual screening is made easy. Structures, both query and those in the target library are transformed into fingerprints. Fingerprints are compared against each other using a dissimilarity metric. If the dissimilarity value obtained by the calculation is below a predefined threshold the corresponding structure is a database hit (or virtual hit). hits targets target fingerprints

Virtual screening using fingerprints
Slide 12 Virtual screening using fingerprints Multiple query structures queries hypothesis fingerprint proximity The real life scenario is somewhat different: there is more than one query structure and the target database is sought for structures that are similar to all/any of these queries to some extent. To tackle this problem, fingerprints of individual active structures are replaced by one fingerprint that represents features common to each and every structure. Since the fingerprint of an individual compound represents the properties of the corresponding compound, this fingerprint can be considered as one that represents the properties of a hypothetical compound. In the case of pharmacophore fingerprints, the hypothesis fingerprint can be considered as a pharmacophore hypothesis, that is, a simple model of the binding site of the receptor of the known actives. hits targets target fingerprints

Hypothesis fingerprints
Slide 13 Hypothesis fingerprints Advantages allows faster operation compiles features common to each individual actives Hypothesis types Active 1 2 7 1 6 4 9 Active 2 3 5 Active 3 Minimum Average 3.67 1.33 Median 1.5 5.5 The use of hypothesis fingerprint offers some advantages over individual fingerprints. Several fingerprints are replaced by a single fingerprint, thus it takes less time to perform computations (e.g. similarity searching) and expands the feasible size of library that can be explored. Features common to all fingerprints are scaled up relative to other less common features. The influence of outliers is decreased. Various hypothesis types can be introduced, these behave in slightly different ways.

Hypothesis fingerprints
Slide 14 Hypothesis fingerprints Advantages Disadvantages Minimum strict conditions for hits if actives are fairly similar false results with asymmetric metrics misses common features of highly diverse sets very sensitive to one missing feature Average captures common features of more diverse active sets less selective if actives are very similar Median specific treatment of the absence of a feature less sensitive to outliers Every hypothesis type has advantages and disadvantages, some of these depend on the characteristics of the active set. The minimum, or consensus hypothesis is the most restrictive. It is most beneficial when actives exhibit high similarity. In contrast to this it is too sensitive for outliers: one missing feature in one of the actives cancels the feature in the hypothesis even if it has high occurrence in all other compounds. Average and median are less conservative allowing to construct reasonable hypothesis for more diverse active sets.

Pharmacophore fingerprint
Slide 15 Does this work? Active set Pharmacophore fingerprint Chemical fingerprint name size Tanimoto Euclidean 5-HT3 12 20.14 12.55 776.19 461.44 ACE 89 1.99 1.42 3.71 1.74 Angiotensin2 10 22.80 27.81 183.45 173.91 Beta2 50 3.59 1.52 7.52 2.65 D2 13 61.25 27.64 302.52 155.61 delta 20 109.53 11.66 114.48 56.22 Ftp 35 50.92 46.88 571.50 575.16 mGluR1 18 70.47 5.59 347.72 130.14 NPY-5 139 1.09 1.00 1.46 1.44 Thrombin 8 2.46 2.56 1.67 In our validation studies 10 different sets of structures with known pharmacological activity were used. Both Tanimoto and Euclidean metrics achieved high enrichment ratios in some cases. Active sets were represented by a median hypothesis. An enrichment ratio of 10 means that with the use of the rational selection (i.e. virtual screening) it is 10 times more likely to pick structures that show activity than with random selection. A value around 20 is good, over 100 is very good. So, we can conclude that virtual screening works efficiently in most cases.

Then why do we need optimization?
Slide 16 Then why do we need optimization? Too many hits Virtual screening can efficiently reduce the target library to a focused library which is enriched in compounds similar to the queries structures. Yet, the size of this focused library can still often be too big: from a few million targets a few hundreds thousands remain. A primary aim is to improve further the enrichment ratio of the virtual screening procedure.

Then why do we need optimization?
Slide 17 Then why do we need optimization? Inconsistent dissimilarity values 0.47 0.55 0.57 Structures that are rich in features may distort dissimilarity ratios. The two molecules on top are both potent ACE inhibitors, they share a common pharmacophore. Their structural diagrams are colored according to the pharmacophore types of atoms: red denotes hydrogen bond acceptor, green indicates hydrophobic atoms. A third structure, taken from the target library (and displayed using atom type coloring) shows higher similarity to either ACE inhibitor, than between the ACE inhibitor. It is a primary expectation to obtain dissimilarity ratios that reflect the actual similarities/dissimilarities observed. In the case of pharmacophore fingerprints these values should correlate with the known activity and bias against such situations as this.

What can be optimized? Parameterized metrics asymmetry factor
Slide 18 What can be optimized? Parameterized metrics asymmetry factor scaling factor One possible approach to improve the efficiency of virtual screening is to introduce free parameters in similarity/dissimilarity metrics (proximities). By tweaking the values of these parameters the behavior of the proximities can be adjusted to score similar structures better than dissimilar ones. Asymmetric (or directed) metrics bias towards the second argument, typically a hypothesis fingerprint. The idea here is to penalize structures that do not exhibit features required by the hypothesis, and ignore extra features (not required by the hypothesis) in the structure. Scaling and weighting have similar nature: they both can depress or enlarge the influence of individual fingerprint cells in the dissimilarity calculation. asymmetry factor weights

Optimization of metrics
Slide 19 Optimization of metrics Step 1 optimize parameters for maximum enrichment Step 2 validate metrics over an independent test set training set training set selected targets known actives query set Parameterized metrics are optimized with respect to a specific active set. Different actives (receptors) have different distinctive features thus optimized metrics are not interchangeable between various active sets. The optimization of parameters is carried out in two main steps. First, parameters are tuned in order to achieve maximal enrichment ratio in screening. Then, in an independent blind test metrics are validated. For the sake of validation both the target set and the set of known actives is subdivided to smaller subsets. Some randomly selected structures from the target library are picked as a training set, typically a few hundred structures are chosen. A larger set, typically a few thousand structures are kept for validation. The active set is divided into three parts, typically all three of the same or similar size. The training and test (or validation or spike) sets have the same role as in case of the target set, though a third portion is used as query structures. test set test set

Slide 20 Optimization of metrics Step 1 optimize parameters for maximum enrichment Target hits Active hits training set query set In the optimization step the two training sets and the query set are used. Individual query structures are represented by a hypothesis fingerprint. The two training sets are mixed together to form one set which is searched for virtual hits using the query hypothesis fingerprint. The goal of the optimization is to find as many elements from the active training set as possible, while keeping the number of virtual hits from the target training set to a minimum. The optimal performance retrieves all actives and no target training structures. query fingerprint

Slide 21 Optimization of metrics One step of the algorithm v1 v2 v3 potential variable value temporarily fixed value running variable value final value vi Tunable parameters introduced in metrics are sampled in an equidistant fashion in a gradually refined scale. This is done in a semi-systematic exhaustive search procedure. The simple optimization algorithm tweaks parameter values independently varying one parameter at a time. Finding a global optimum does not require guidance, thus this simple algorithm works well and quickly. It is considered that a more sophisticated algorithm would not provide clear benefits. During metric optimization all possible combinations of parameterized metrics can be considered: scaled Tanimoto, asymmetric Tanimoto, normalized Euclidean, weighted Euclidean, asymmetric Euclidean, scaled asymmetric normalized Euclidean etc. vn

Slide 22 Optimization of metrics Step 2 validate metrics over an independent test set Target hits Active hits test set query set When all selected parameterized metrics have been optimized they are tested in an independent split-sample validation step. Actives not used in the training/optimization step are mixed into the yet unused subset of target structures. These are called the spikes. In contrast to the known active structures, the structures in the target set are assumed to exhibit no activity. The bigger the ratio between the number of spikes and the number of target structures retrieved, the better the optimized metric has performed, i.e. the higher the enrichment over random is. query fingerprint

Results Similar structures get closer 0.57 0.55 0.47 0.20 0.06 0.28
Slide 23 Results Similar structures get closer 0.47 0.55 0.57 0.20 0.06 Optimization stage can dramatically change the behavior of the metrics: from our previous example the distance (dissimilarity value) between the two known actives has been significantly reduced with the optimized metric relative to the non-active structure. We have completed one goal of the active set dependent optimization of virtual screening. 0.28

Results Hit set size reduction Active set: 18 mGlu-R1 antagonists
Slide 24 Results Hit set size reduction Active set: 18 mGlu-R1 antagonists Target set: randomly selected drug-like structures + 7 spikes Metric Enrichment Test hits Random hits Tanimoto Basic 70.47 5.43 172.00 Scaled 7.63 6.00 Asymmetric 99.36 5.29 106.00 Scaled Asymmetric 11.94 5.86 731.14 Euclidean 5.59 Normalized 11.33 5.14 791.29 Asymmetric Normalized 18.58 4.71 368.71 Weighted Normalized 296.30 4.14 27.57 Weighted Asymmetric Normalized 281.30 3.43 17.00 Various optimized parameterized metrics perform differently. In the experiment summarized in this table the weighted normalized Euclidean metric scored the best with an enrichment ratio of over random, retrieving an average of 4.14 spikes out of 7. This result means that it is times more likely to find an active molecule among 7 structures with the rational selection of structures from the database than with a random selection. Practically speaking, this means that a compound library consisted of 1 million structures can be reduced to a few thousand structures by virtual screening using this optimized metric. This focused library is enriched with potential actives, i.e. structures that show similar pharmacophore to known actives. Figures shown in the table are the average of 7 independent tests using median hypothesis. The threshold for acceptance was set in the optimization procedure so as to retrieve 80% of the actives in the training set.

Results Improvement by optimization Active set size Euclidean
Slide 25 Results Improvement by optimization Active set size Euclidean Optimized Improvement ratio 5-HT3 12 12.55 239.24 49.26 ACE 89 1.42 6.50 4.64 Angiotensin2 10 27.81 85.45 11.15 Beta2 50 1.52 24.70 17.42 D2 13 27.64 123.25 11.19 delta 20 11.66 243.57 69.11 Ftp 35 46.88 71.54 5.35 mGluR1 18 5.59 296.30 70.93 NPY-5 139 1.00 3.22 3.25 Thrombin 8 2.56 4.57 2.62 Optimization can improve the efficiency of virtual screening in many cases. This table summarizes results of a 7-fold cross-validation. The descriptor was a pharmacophore fingerprint, and all variant of the Euclidean metric were optimized. Values in the yellow and orange columns are average enrichment ratios. The improvement is ratio of enrichment values after and before optimization. Note, that this improvement factor is not the ratio of average enrichments but the average of ratios of the actual enrichment values.

Results Active Hit Distribution
Slide 26 Results Active Hit Distribution offers a more intuitive way to evaluate the efficiency of screening based on sorting random set hits and known actives on dissimilarity values and counting the number of random set hits preceding each active in the sorted list 0.014 0.015 0.017 0.020 0.022 0.023 0.027 0.041 0.043 number of virtual hits Enrichment, though widely used, is not the most suitable characterization of the efficiency of virtual screening: Its value is not bounded, there is no ‘best’ enrichment. Its maximal possible value depends on the number of known actives. It is too sensitive to non-active hits, finding 2 actives and 0 inactives may result in much larger enrichment ratio than finding let’s say all six actives and 3 inactives. Its actual value strongly depends on the threshold value used in virtual screening. The distribution of actives in the series of virtual hits gives rise to an intuitive visualization of the efficiency of virtual screening as well as to an alternative measure to enrichment. Virtual hits are sorted by dissimilarity values calculated during the screening process. In the ideal case all actives are the first to emerge from this sorting process. The more the actives are spread within the sorted series the worse the screening procedure performs. The light blue dashed line indicates the theoretical optimum, when the first n hits are all from the set of known actives. The closer this line is approached by the sample points the better the screening has performed. number of actives

Results ACE (pharmacophore similarity) Slide 27
An active Hit Distribution plot for the ACE inhibitor set. The diagram clearly shows that optimization improved the performance of the Euclidean metric by at least an order of magnitude.

Results NPY-5 (pharmacophore similarity) Slide 28
There are cases when even basic metrics perform optimally. From our validation studies, in the case of the NPY-5 active set the first seven hits were actives in all cases. However, to retrieve more than 10 actives required a larger virtual hit set when basic metrics were used. Optimization led to 1 to 2 orders of magnitude improvement.

Results β2-adrenoceptor (pharmacophore similarity) Slide 29
This example demonstrates that optimization can achieve optimal behavior even when basic metrics perform poorly. This plot illustrates best the advantage of optimization in virtual screening.

Results Structural or pharmacophore fingerprint? Active set size
Slide 30 Results Structural or pharmacophore fingerprint? Active set size chemical pharmacophore diversity* 5-HT3 12 692.21 239.24 0.30 ACE 89 4.29 6.50 0.56 Angiotensin2 10 190.76 85.45 0.40 Beta2 50 10.98 24.70 0.50 D2 13 358.10 123.25 delta 20 249.40 243.57 0.32 Ftp 35 575.16 71.54 mGluR1 18 350.86 296.30 0.37 NPY-5 139 1.52 3.22 0.47 Thrombin 8 3.59 4.57 0.46 Generating the structural fingerprint is much faster than creating the pharmacophore fingerprint (in which pharmacophore perception takes long time), and the storage requirement of the structural fingerprint is also significantly smaller. Thus it is a naturally arising question: is it worth using the more resource expensive pharmacophore fingerprint? What benefits can it have and under what circumstances? In this table the best mean enrichment ratios achieved are picked for both topological and pharmacophore fingerprint. The ‘diversity’ column is an estimation of the topological diversity of the corresponding active set. This table indicates that the use of the costly pharmacophore fingerprint is beneficial when the diversity of the library is over 0.45. (Please note, that these data are preliminary results and more thorough statistical analysis will be carried out later.) * Average 1-Tanimoto coefficient between each pair of compounds in the active set, based on chemical fingerprint.

Results Scaffold hopping Slide 31
Scaffold hopping is particularly important when screening a database for novel drug candidates. In this example 30 ACE inhibitors were mixed in a library of drug-like molecules. Another ACE inhibitor, Captopril (left) was used as a query. The 10 top hits retrieved by an optimized scaled Tanimoto metric were all spikes (right). 4 of these have different scaffold (molecular backbone) than the query structure, while the other 6 are structural analogs. In this experiment pharmacophore fingerprint was used. Due to its nature, scaffold hopping can hardly be achieved by chemical fingerprint.

Acknowledgements Contributors: Nóra Máté Szilárd Dóránt
Slide 32 Acknowledgements Contributors: Nóra Máté Szilárd Dóránt Bernard Przybylski (Axovan) The research was supported by (Axovan is now part of Actelion.)

Bibliography J. Xu: GMA: A Generic Match Algorithm for Structural Homomorphism, Isomorphism, and Maximal Common Substructure Match and its Applications, J. Chem. Inf. Comput. Sci., 1996, 36, 1, L. Xue, F. L. Stahura, J. W. Godden, J. Bajorath: Fingerprint Scaling Increases the Probability of Identifying Molecules with Similar Activity in Virtual Screening Calculations, J. Chem. Inf. Comput. Sci., 2001, 41, 3, G. Schneider, W. Neidhart, T. Giller, and G. Schmid: 'Scaffold-Hopping' by Topological Pharmacophore Search: A Contribution to Virtual Screening, Angew. Chem. Int. Ed., 1999, 38, 19, D. Horvath: High Throughput Conformational Sampling and Fuzzy Similarity Metrics: A Novel Approach to Similarity Searching and Focused Combinatorial Library Design and its Role in the Drug Discovery Laboratory; manuscript J. Bajorath: Virtual screening in drug discovery: Methods, expectations and reality

Optimized Virtual Screening

Similar presentations

Presentation on theme: "Optimized Virtual Screening"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimized Virtual Screening

Similar presentations

Presentation on theme: "Optimized Virtual Screening"— Presentation transcript:

Similar presentations

About project

Feedback