© 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

© 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users Conference Jun Xu Boehringer Ingelheim Pharmaceuticals, Inc. May 3, 2001

Introduction: Diversity & Drug Design Lead Screening –Select compounds for UHTS –Select compounds for acquisition Combinatorial Library Design –Compare virtual libraries –Compare virtual libraries against existing inventory –Select sub-library to make

Importance of Data Visualization Graphically review structural diversity Graphically filter unwanted compounds Graphically select sub-set Graphically study the relations between structure and activity

Challenge! Chemical structures are graphs The number of compounds in a library can be very large

Solution to study the diversity of a large compound library conventional methods

Mapping Principal Component Analysis (PCA) –Transform a matrix M(m,n) to M’(m,n’) –The n’ dismensions are sorted based on the eigenvalues –If the top-three dimensions can explain >85% of the data, the M’(m,3) is the fair approximation of M(m,n), otherwise PCA cannot be used for mapping Multi-Dimensional Scaling (MDS) –Based on distance matrix –Convert M(m,n) to M’(m,2) in an irrational method One of the Problems –The new dimensions have no chemical/physical meaning

An example of mapping

Clustering To divide n objects into m bins (n  m) The clustering is pattern recognition The clustering can be a unsupervised learning

General steps for clustering Select the data of describing objects Extract patterns from the data –normalizing rows –normalizing columns –normalizing methods Measure Similarity Select a proper and robust clustering method

Problems in conventional methods Selecting and computing “correct” descriptors are difficult and time-consuming Hierarchical algorithms force “dogs” and “cats” to be together Non-hierarchical algorithms ask for “number of clusters” and other settings SOM method asks you to set at least eight irrational parameters

How many do you want? How many clusters are in my library?... K-mean cluster:

K-mean and K-nearest Neighbor Approaches Assuming the number of clusters is known Computing complexity: N j represents the number of jth combinations in k clusters (groups) n represents the number of objects n i represents the number of objects in the ith cluster k represents the number of clusters It is NP-complete problem

Self Organization Map (SOM) Approach To run SOM, 8 parameters have to be set up properly as follows: –Data Initialization: random or ordered –Neighborhood function: Bubble or Gaussian –Neuron topology: hexagonal or Rectangular –Neural dimensions: X and Y (how many cells/neurons) –Number of training steps: such as, 10,000 –Initial learning rate: such as, 0.03 –Initial radius of training area: such as, 10 –Monitoring parameter: number of steps for generating 2D points on a plane, such as, 100

S-Cluster: New approach No need to compute descriptors No need to give the number of clusters Faster Rational parameters Results are explained chemically

S-Cluster Algorithm (1) Extract scaffolds Reference scaffold (S v ): –number of smallest set of smallest rings (sssrs) –number of non-H atoms (atoms) –number of bonds (excluding H bonds) (bonds) –sum of non-H atomic numbers (zs) –V v = { sssrs, atoms, bonds, zs } Sv

Deriving Scaffolds

S-Cluster Algorithm (2) The complexity of a structure: V i = { sssrs, atoms, bonds, zs } Si for S i V v = { sssrs, atoms, bonds, zs } from a reference scaffold P i = || V v + V i || M i = || V v - V i ||

S-Cluster Algorithm (3) The “Cyclicity” of a structure –The sum of heavy atomic numbers (a) –The umber of rotating bonds ( r ) –The number of 1-degree nodes (d1) –The number of double bonds (db) –The number of triple bonds (tb) –The number of 2-degree nodes (d2) –V s = { a, r, d1, db, tb, d2 } saffold –V i = { a, r, d1, db, tb, d2 } structure(i)

Results and discussions Cluster following libraries together: –ACD (250,468 structures) –NCI (126,554, MDL 1994) –CMC (4591 oral drugs) –MDDR (6347 launch or pre-clinical drugs or compounds) Cluster all 387,960 structures on an NT laptop (Compaq, Armada E700) Running time: 1 h 42 mins

Cyclicity vs Complexity

Most complicated structure is on the upper-right

Most chain-like structure is on the bottom-left

Zoom-in: Substituent Patterns

Diversity “Island” and “Density” A: Single O substituents B: Single F substituents

“Cyclicity” vs Average Electronegativity

“Cyclicity” vs H-Bond Donors

Reagent Selector (R) Clustering Result(Jarvis-Patrick Method): Input 116 compounds, Ask for 26 clusters This is cluster 2

Result from the S-Cluster Algorithm: Input 116 compounds, 26 clusters were found This is cluster 2

Applications Evaluate libraries Compare libraries Design a focused library

Blue: Virtual Library Red: Target Library

The optimized sub-library to be made from the virtual library

But, if you still want to cluster molecules (genes, or small molecules) based upon their property/activity arrays... We have V-Cluster (Vector Cluster Algorithm) for these requirement, it will be presented later

Conclusions We emphasize on finding natural clusters There must be chemical/physical explanations for computational results Before a software “button” is pushed, the mathematical/chemical/physical/biological meaning should be understood Good algorithm should be robust

Acknowledgements Cheminformatics/Medicinal Chemistry –Dr. Qiang Zhang –Dr. Hans Briem –Dr. Ron Magolda

© 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Similar presentations

Presentation on theme: "© 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users.

Similar presentations

Presentation on theme: "© 2001, Boehringer, Inc. - All Rights Reserved. SCA: New Cluster Algorithm for Structural Diversity Analysis and Applications Presented at Spotfire Users."— Presentation transcript:

Similar presentations

About project

Feedback