Presentation on theme: "Optimized Virtual Screening"— Presentation transcript:
1 Optimized Virtual Screening Slide 1Optimized Virtual ScreeningMiklós VargyasZsuzsanna SzabóGyörgy PirokFerenc CsizmadiaMatthias StegerModest von KorffChemAxon Ltd.AXOVAN AGAllschwil, Switzerland(Axovan is now Actelion.)
2 Drug research Is it searching for a needle in a haystack? Slide 2Drug researchIs it searching for a needle in a haystack?corporate databasestructures foundDrug research is often termed as searching for a needle in a haystack. Large compound libraries, corporate databases, virtual compound collections and supplier’s databases are routinely screened for drug candidates. The situation here is worse than finding a needle in a haystack, as we do not exactly know the shape of the needle.We have very limited knowledge about the “thing” we need to find. Often we do not know how it looks or even if it is there. Of course, there is a distinct possibility we may end up finding more than one.
3 Drug research Find something similar to a fistful of needles Slide 3Drug researchFind something similar to a fistful of needlesstructures found (virtual hits)corporate database (targets)query structures (known actives)In virtual screening the target database is searched for structures that bear some similarity to a small set of query structures. Typically, these molecules exhibit similar biological/pharmacological activity, for instance they can bind to the same protein to form a receptor-ligand complex. The expectation in virtual screening is that structures in the database found to be similar to query structures are functional analogs and thus will have similar activity.(Note, that in this talk we focus on ligand base drug research where there is no direct information on the structure of the target protein available. Techniques like 3d docking lie outside of the scope of this work.)
4 Molecular similarity What is it? Chemical Pharmacophore Slide 4Molecular similarityWhat is it?Chemical, pharmacological or biological properties of two compounds match.The more the common features, the higher the similarity between two molecules.ChemicalIn the simplest case the question of molecular similarity is raised for two molecules. They are regarded as similar entities if either chemical/topological, pharmacological or biological properties match.The two structures on top are chemically similar to each other. This is reflected in their common sub-graph, or scaffold: they share 14 atoms.The two other structures at the bottom are less similar chemically (topologically) yet have the same pharmacological activity, namely they both are Angiotensin-Converting Enzyme (ACE) inhibitors.Pharmacophore
5 Molecular similarity How to calculate it? Slide 5Molecular similarityHow to calculate it?Quantitative assessment of similarity/dissimilarity of structuresneed a numerically tractable formmolecular descriptors, fingerprints, structural keysSequences/vectors of bits, or numeric values that can be compared by distance functions, similarity metrics.Assessing the similarity or dissimilarity of two compounds is typically not done on structural level (though it is possible, for instance by the size of maximum common sub-graph), instead, structures are encoded into (or represented by) a set of values that are numerically tractable. There is a wide variety of such sets of values available: molecular descriptor, molecular fingerprints, structural keys are all well-known approaches. These can be represented by a multidimensional vector, and the similarity between the original structures can be expressed as the distance between these two vectors. The Euclidean distance is the most widely used example of such type of dissimilarity functions. Another family of such proximities regards the set of values as a series and calculates the ratio between matching and different values in the same position in the two series. A remarkable example is the Tanimoto coefficient.
6 Molecular descriptors Slide 6Molecular descriptorsExample 1: chemical fingerprinthashed binary fingerprintencodes topological properties of the chemical graph: connectivity, edge label (bond type), node label (atom type)allows the comparison of two molecules with respect to their chemical structureConstructionfind all 0, 1, …, n step walks in the chemical graphgenerate a bit array for each walks with given number of bits setmerge the bit arrays with logical OR operationTopological chemical fingerprints encode structural properties of the chemical graph as a sequence of bits. The encoding (hashing) is not reversible, thus two different structures can have the same fingerprint (therefore the name is not so adequate). If a certain feature is present in the structure, for instance a C-O-H pattern, then specific bits in the series are set invariably, however, these very same bits can be set by many other structural features. These properties of the hashed fingerprint make it suitable for structural comparison, substructure search and similarity search.There are various encoding schemas available, one takes all walks of a given maximum length in the chemical graph as patterns to be encoded. For each such structural pattern certain bits are ‘turned on’ in the fingerprint. Once a bit is turned on by a certain feature, it remains 1, other features cannot cancel it out.
7 Molecular descriptors Slide 7Molecular descriptorsExample 1: chemical fingerprintExampleCH3 – CH2 – OHwalks from the first carbon atomlengthwalkbit arrayC1C – HC – C2C – C – HC – C – O3C – C – O – HThis example illustrates how a 10 bits long topological chemical fingerprint is created for a simple chain structure. In this example all walks up to 3 steps are considered, and 2 bits are set for each pattern.merge bit arrays for the first carbon atom:
8 Molecular descriptors Slide 8Molecular descriptorsExample 1: chemical fingerprintTwo β2 adrenoceptor agonist molecules, and their 64 bit hashed chemical fingerprints. These fingerprints clearly reflect the high structural similarity between the two compounds: there are only two bits that differ.
9 Molecular descriptors Slide 9Molecular descriptorsExample 2: pharmacophore fingerprintencodes pharmacophore properties of molecules as frequency counts of pharmacophore point pairs at given topological distanceallows the comparison of two molecules with respect to their pharmacophoreConstructionmap pharmacophore point type to atomscalculate length of shortest path between each pair of atomsassign a histogram to every pharmacophore point pairs and count the frequency of the pair with respect to its distanceBits in the hashed binary fingerprint cannot be interpreted, there is no way to infer properties of the original structure from its fingerprint. There are other ways to construct fingerprints other than hashing. An example is pharmacophore fingerprints where each pharmacophore point pair is associated with a histogram bar. Such a fingerprint is not binary, yet it allows fast comparison of the pharmacophore of chemical structures.The construction of such fingerprints is fairly straightforward. First, each atom is labeled with its pharmacophore type (e.g. hydrogen bond donor, hydrophobic, anionic etc). Then shortest paths between each point pairs are calculated. Then histograms are assigned to each pharmacophore type pairs (e.g. acceptor-acceptor, acceptor-donor, acceptor-hydrophobic etc). Each histogram has the same predefined number of bins, these belong to different topological distances considered (e.g. 1, 2, etc, up to 10). Bins store the number of the associated pharmacophore type pairs lying at the given topological distance.
10 Molecular descriptors Slide 10Molecular descriptorsExample 2: pharmacophore fingerprintPharmacophore point type based coloring of atoms: acceptor, donor, hydrophobic, none.The pharmacophores of these two structures are the same (these are both ACE inhibitors).A topological cross-correlation pharmacophore fingerprint is constructed from the structures and the mapped pharmacophoric point types. (In this example only three different pharmacophore point types were considered: acceptor, donor and hydrophobic.) Each point type pair is counted in a corresponding histogram bin depending on their topological distance. (Topological distances from 1 to 6 bonds apart were considered.) The two histograms visibly represent the pharmacophoric similarity of the two compounds, especially specific (or hard) pharmacophore points (hydrogen bond acceptor and donor) related histograms (AA, DA) show significant similarity. Bars in the histograms of the second structure are higher due to the larger size of the corresponding molecular graph.
11 Virtual screening using fingerprints Slide 11Virtual screening using fingerprintsIndividual query structurequery fingerprintqueryproximityEquipped with fingerprints and dissimilarity metrics virtual screening is made easy. Structures, both query and those in the target library are transformed into fingerprints. Fingerprints are compared against each other using a dissimilarity metric. If the dissimilarity value obtained by the calculation is below a predefined threshold the corresponding structure is a database hit (or virtual hit).hitstargetstarget fingerprints
12 Virtual screening using fingerprints Slide 12Virtual screening using fingerprintsMultiple query structuresquerieshypothesis fingerprintproximityThe real life scenario is somewhat different: there is more than one query structure and the target database is sought for structures that are similar to all/any of these queries to some extent. To tackle this problem, fingerprints of individual active structures are replaced by one fingerprint that represents features common to each and every structure. Since the fingerprint of an individual compound represents the properties of the corresponding compound, this fingerprint can be considered as one that represents the properties of a hypothetical compound. In the case of pharmacophore fingerprints, the hypothesis fingerprint can be considered as a pharmacophore hypothesis, that is, a simple model of the binding site of the receptor of the known actives.hitstargetstarget fingerprints
13 Hypothesis fingerprints Slide 13Hypothesis fingerprintsAdvantagesallows faster operationcompiles features common to each individual activesHypothesis typesActive 1271649Active 235Active 3MinimumAverage3.671.33Median1.55.5The use of hypothesis fingerprint offers some advantages over individual fingerprints. Several fingerprints are replaced by a single fingerprint, thus it takes less time to perform computations (e.g. similarity searching) and expands the feasible size of library that can be explored. Features common to all fingerprints are scaled up relative to other less common features. The influence of outliers is decreased.Various hypothesis types can be introduced, these behave in slightly different ways.
14 Hypothesis fingerprints Slide 14Hypothesis fingerprintsAdvantagesDisadvantagesMinimumstrict conditions for hits if actives are fairly similarfalse results with asymmetric metricsmisses common features of highly diverse setsvery sensitive to one missing featureAveragecaptures common features of more diverse active setsless selective if actives are very similarMedianspecific treatment of the absence of a featureless sensitive to outliersEvery hypothesis type has advantages and disadvantages, some of these depend on the characteristics of the active set. The minimum, or consensus hypothesis is the most restrictive. It is most beneficial when actives exhibit high similarity. In contrast to this it is too sensitive for outliers: one missing feature in one of the actives cancels the feature in the hypothesis even if it has high occurrence in all other compounds. Average and median are less conservative allowing to construct reasonable hypothesis for more diverse active sets.
15 Pharmacophore fingerprint Slide 15Does this work?Active setPharmacophore fingerprintChemical fingerprintnamesizeTanimotoEuclidean5-HT31220.1412.55776.19461.44ACE891.991.423.711.74Angiotensin21022.8027.81183.45173.91Beta2503.591.527.522.65D21361.2527.64302.52155.61delta20109.5311.66114.4856.22Ftp3550.9246.88571.50575.16mGluR11870.475.59347.72130.14NPY-51391.091.001.461.44Thrombin82.462.561.67In our validation studies 10 different sets of structures with known pharmacological activity were used. Both Tanimoto and Euclidean metrics achieved high enrichment ratios in some cases. Active sets were represented by a median hypothesis.An enrichment ratio of 10 means that with the use of the rational selection (i.e. virtual screening) it is 10 times more likely to pick structures that show activity than with random selection. A value around 20 is good, over 100 is very good.So, we can conclude that virtual screening works efficiently in most cases.
16 Then why do we need optimization? Slide 16Then why do we need optimization?Too many hitsVirtual screening can efficiently reduce the target library to a focused library which is enriched in compounds similar to the queries structures. Yet, the size of this focused library can still often be too big: from a few million targets a few hundreds thousands remain. A primary aim is to improve further the enrichment ratio of the virtual screening procedure.
17 Then why do we need optimization? Slide 17Then why do we need optimization?Inconsistent dissimilarity values0.470.550.57Structures that are rich in features may distort dissimilarity ratios. The two molecules on top are both potent ACE inhibitors, they share a common pharmacophore. Their structural diagrams are colored according to the pharmacophore types of atoms: red denotes hydrogen bond acceptor, green indicates hydrophobic atoms. A third structure, taken from the target library (and displayed using atom type coloring) shows higher similarity to either ACE inhibitor, than between the ACE inhibitor. It is a primary expectation to obtain dissimilarity ratios that reflect the actual similarities/dissimilarities observed. In the case of pharmacophore fingerprints these values should correlate with the known activity and bias against such situations as this.
18 What can be optimized? Parameterized metrics asymmetry factor Slide 18What can be optimized?Parameterized metricsasymmetry factorscaling factorOne possible approach to improve the efficiency of virtual screening is to introduce free parameters in similarity/dissimilarity metrics (proximities). By tweaking the values of these parameters the behavior of the proximities can be adjusted to score similar structures better than dissimilar ones.Asymmetric (or directed) metrics bias towards the second argument, typically a hypothesis fingerprint. The idea here is to penalize structures that do not exhibit features required by the hypothesis, and ignore extra features (not required by the hypothesis) in the structure.Scaling and weighting have similar nature: they both can depress or enlarge the influence of individual fingerprint cells in the dissimilarity calculation.asymmetry factorweights
19 Optimization of metrics Slide 19Optimization of metricsStep 1 optimize parameters for maximum enrichmentStep 2 validate metrics over an independent test settraining settraining setselected targetsknown activesquery setParameterized metrics are optimized with respect to a specific active set. Different actives (receptors) have different distinctive features thus optimized metrics are not interchangeable between various active sets. The optimization of parameters is carried out in two main steps. First, parameters are tuned in order to achieve maximal enrichment ratio in screening. Then, in an independent blind test metrics are validated. For the sake of validation both the target set and the set of known actives is subdivided to smaller subsets. Some randomly selected structures from the target library are picked as a training set, typically a few hundred structures are chosen. A larger set, typically a few thousand structures are kept for validation. The active set is divided into three parts, typically all three of the same or similar size. The training and test (or validation or spike) sets have the same role as in case of the target set, though a third portion is used as query structures.test settestset
20 Optimization of metrics Slide 20Optimization of metricsStep 1 optimize parameters for maximum enrichmentTarget hitsActive hitstraining setquery setIn the optimization step the two training sets and the query set are used. Individual query structures are represented by a hypothesis fingerprint. The two training sets are mixed together to form one set which is searched for virtual hits using the query hypothesis fingerprint. The goal of the optimization is to find as many elements from the active training set as possible, while keeping the number of virtual hits from the target training set to a minimum. The optimal performance retrieves all actives and no target training structures.query fingerprint
21 Optimization of metrics Slide 21Optimization of metricsOne step of the algorithmv1v2v3potential variable valuetemporarily fixed valuerunning variable valuefinal valueviTunable parameters introduced in metrics are sampled in an equidistant fashion in a gradually refined scale. This is done in a semi-systematic exhaustive search procedure. The simple optimization algorithm tweaks parameter values independently varying one parameter at a time. Finding a global optimum does not require guidance, thus this simple algorithm works well and quickly. It is considered that a more sophisticated algorithm would not provide clear benefits.During metric optimization all possible combinations of parameterized metrics can be considered: scaled Tanimoto, asymmetric Tanimoto, normalized Euclidean, weighted Euclidean, asymmetric Euclidean, scaled asymmetric normalized Euclidean etc.vn
22 Optimization of metrics Slide 22Optimization of metricsStep 2 validate metrics over an independent test setTarget hitsActive hitstest setquery setWhen all selected parameterized metrics have been optimized they are tested in an independent split-sample validation step. Actives not used in the training/optimization step are mixed into the yet unused subset of target structures. These are called the spikes. In contrast to the known active structures, the structures in the target set are assumed to exhibit no activity.The bigger the ratio between the number of spikes and the number of target structures retrieved, the better the optimized metric has performed, i.e. the higher the enrichment over random is.query fingerprint
23 Results Similar structures get closer 0.57 0.55 0.47 0.20 0.06 0.28 Slide 23ResultsSimilar structures get closer0.470.550.570.200.06Optimization stage can dramatically change the behavior of the metrics: from our previous example the distance (dissimilarity value) between the two known actives has been significantly reduced with the optimized metric relative to the non-active structure.We have completed one goal of the active set dependent optimization of virtual screening.0.28
24 Results Hit set size reduction Active set: 18 mGlu-R1 antagonists Slide 24ResultsHit set size reductionActive set: 18 mGlu-R1 antagonistsTarget set: randomly selected drug-like structures + 7 spikesMetricEnrichmentTest hitsRandom hitsTanimotoBasic70.475.43172.00Scaled7.636.00Asymmetric99.365.29106.00Scaled Asymmetric11.945.86731.14Euclidean5.59Normalized11.335.14791.29Asymmetric Normalized18.584.71368.71Weighted Normalized296.304.1427.57Weighted Asymmetric Normalized281.303.4317.00Various optimized parameterized metrics perform differently. In the experiment summarized in this table the weighted normalized Euclidean metric scored the best with an enrichment ratio of over random, retrieving an average of 4.14 spikes out of 7. This result means that it is times more likely to find an active molecule among 7 structures with the rational selection of structures from the database than with a random selection. Practically speaking, this means that a compound library consisted of 1 million structures can be reduced to a few thousand structures by virtual screening using this optimized metric. This focused library is enriched with potential actives, i.e. structures that show similar pharmacophore to known actives.Figures shown in the table are the average of 7 independent tests using median hypothesis. The threshold for acceptance was set in the optimization procedure so as to retrieve 80% of the actives in the training set.
25 Results Improvement by optimization Active set size Euclidean Slide 25ResultsImprovement by optimizationActive setsizeEuclideanOptimizedImprovement ratio5-HT31212.55239.2449.26ACE891.426.504.64Angiotensin21027.8185.4511.15Beta2501.5224.7017.42D21327.64123.2511.19delta2011.66243.5769.11Ftp3546.8871.545.35mGluR1185.59296.3070.93NPY-51391.003.223.25Thrombin82.564.572.62Optimization can improve the efficiency of virtual screening in many cases. This table summarizes results of a 7-fold cross-validation. The descriptor was a pharmacophore fingerprint, and all variant of the Euclidean metric were optimized. Values in the yellow and orange columns are average enrichment ratios. The improvement is ratio of enrichment values after and before optimization. Note, that this improvement factor is not the ratio of average enrichments but the average of ratios of the actual enrichment values.
26 Results Active Hit Distribution Slide 26ResultsActive Hit Distributionoffers a more intuitive way to evaluate the efficiency of screeningbased on sorting random set hits and known actives on dissimilarity values and counting the number of random set hits preceding each active in the sorted list0.0140.0150.0170.0200.0220.0230.0270.0410.043number of virtual hitsEnrichment, though widely used, is not the most suitable characterization of the efficiency of virtual screening:Its value is not bounded, there is no ‘best’ enrichment.Its maximal possible value depends on the number of known actives.It is too sensitive to non-active hits, finding 2 actives and 0 inactives may result in much larger enrichment ratio than finding let’s say all six actives and 3 inactives.Its actual value strongly depends on the threshold value used in virtual screening.The distribution of actives in the series of virtual hits gives rise to an intuitive visualization of the efficiency of virtual screening as well as to an alternative measure to enrichment. Virtual hits are sorted by dissimilarity values calculated during the screening process. In the ideal case all actives are the first to emerge from this sorting process. The more the actives are spread within the sorted series the worse the screening procedure performs. The light blue dashed line indicates the theoretical optimum, when the first n hits are all from the set of known actives. The closer this line is approached by the sample points the better the screening has performed.number of actives
27 Results ACE (pharmacophore similarity) Slide 27 An active Hit Distribution plot for the ACE inhibitor set. The diagram clearly shows that optimization improved the performance of the Euclidean metric by at least an order of magnitude.
28 Results NPY-5 (pharmacophore similarity) Slide 28 There are cases when even basic metrics perform optimally. From our validation studies, in the case of the NPY-5 active set the first seven hits were actives in all cases. However, to retrieve more than 10 actives required a larger virtual hit set when basic metrics were used. Optimization led to 1 to 2 orders of magnitude improvement.
29 Results β2-adrenoceptor (pharmacophore similarity) Slide 29 This example demonstrates that optimization can achieve optimal behavior even when basic metrics perform poorly. This plot illustrates best the advantage of optimization in virtual screening.
30 Results Structural or pharmacophore fingerprint? Active set size Slide 30ResultsStructural or pharmacophore fingerprint?Active setsizechemicalpharmacophorediversity*5-HT312692.21239.240.30ACE894.296.500.56Angiotensin210190.7685.450.40Beta25010.9824.700.50D213358.10123.25delta20249.40243.570.32Ftp35575.1671.54mGluR118350.86296.300.37NPY-51391.523.220.47Thrombin83.594.570.46Generating the structural fingerprint is much faster than creating the pharmacophore fingerprint (in which pharmacophore perception takes long time), and the storage requirement of the structural fingerprint is also significantly smaller. Thus it is a naturally arising question: is it worth using the more resource expensive pharmacophore fingerprint? What benefits can it have and under what circumstances?In this table the best mean enrichment ratios achieved are picked for both topological and pharmacophore fingerprint. The ‘diversity’ column is an estimation of the topological diversity of the corresponding active set. This table indicates that the use of the costly pharmacophore fingerprint is beneficial when the diversity of the library is over 0.45.(Please note, that these data are preliminary results and more thorough statistical analysis will be carried out later.)* Average 1-Tanimoto coefficient between each pair of compounds in the active set, based on chemical fingerprint.
31 Results Scaffold hopping Slide 31 Scaffold hopping is particularly important when screening a database for novel drug candidates. In this example 30 ACE inhibitors were mixed in a library of drug-like molecules. Another ACE inhibitor, Captopril (left) was used as a query. The 10 top hits retrieved by an optimized scaled Tanimoto metric were all spikes (right). 4 of these have different scaffold (molecular backbone) than the query structure, while the other 6 are structural analogs.In this experiment pharmacophore fingerprint was used. Due to its nature, scaffold hopping can hardly be achieved by chemical fingerprint.
32 Acknowledgements Contributors: Nóra Máté Szilárd Dóránt Slide 32AcknowledgementsContributors:Nóra MátéSzilárd DórántBernard Przybylski (Axovan)The research was supported by(Axovan is now part of Actelion.)
33 Slide 33BibliographyJ. Xu: GMA: A Generic Match Algorithm for Structural Homomorphism, Isomorphism, and Maximal Common Substructure Match and its Applications, J. Chem. Inf. Comput. Sci., 1996, 36, 1,L. Xue, F. L. Stahura, J. W. Godden, J. Bajorath: Fingerprint Scaling Increases the Probability of Identifying Molecules with Similar Activity in Virtual Screening Calculations, J. Chem. Inf. Comput. Sci., 2001, 41, 3,G. Schneider, W. Neidhart, T. Giller, and G. Schmid: 'Scaffold-Hopping' by Topological Pharmacophore Search: A Contribution to Virtual Screening, Angew. Chem. Int. Ed., 1999, 38, 19,D. Horvath: High Throughput Conformational Sampling and Fuzzy Similarity Metrics: A Novel Approach to Similarity Searching and Focused Combinatorial Library Design and its Role in the Drug Discovery Laboratory; manuscriptJ. Bajorath: Virtual screening in drug discovery: Methods, expectations and reality