Presentation on theme: "Parallel tiered clustering for large data sets using a modified Taylors algorithm J. MacCuish 1 N. MacCuish 1, M. Chapman 1 1 Mesa Analytics & Computing,"— Presentation transcript:
Parallel tiered clustering for large data sets using a modified Taylors algorithm J. MacCuish 1 N. MacCuish 1, M. Chapman 1 1 Mesa Analytics & Computing, Inc, Santa Fe, New Mexico, USA
Abstract Clustering large sets has many applications in drug discovery, among them compound acquisition decisions and combinatorial library diversification. Molecular fingerprints (2D) and molecular shape conformers (3D) from PubChem are the basic descriptors comprising the large sets utilized in this study. A parallel tiered clustering algorithm, implementing a modified Taylors algorithm, will be described as an efficient method for analyzing datasets of such large scale. Results will be presented in SAESAR (Shape And Electrostatics Structure Activity Relationships).
Motivation Though leader and related exclusion region clustering algorithms such as Taylor/Butina 1,2 clustering, are fast and can group millions of compound fingerprints in parallel, they suffer from the difficulty of finding an appropriate region threshold for the problem at hand. K-means clustering, also used for large scale clustering, suffers an analogous problem in the choice of K. Finding an appropriate threshold or choice of K for the data can be very computationally expensive, above and beyond the expense of clustering millions of compounds
Methods Algorithm: We use Taylors algorithm, modified to assign false singletons to their nearest respective cluster, and break exclusion region ties (clusters with the greatest membership have the same cardinality) by finding the most compact cluster. Additionally, the input to the algorithm is a sparse matrix composed of only those values that are reasonable dissimilarities. This helps on two counts, one the generation of the matrix can take into account other efficiencies in eliminating unnecessary comparisons; second, the sparse matrix greatly reduces internal memory and disk storage, often by a large constant factor (e.g., 100 times). The algorithm returns clusters, their respective representative elements (centroids or centrotypes), true singletons, and false singleton cluster assignments, and ambiguity statistics.
Tiered Taylors Taylors algorithm can then be used as a base algorithm to iteratively span a set of regular thresholds, successively reducing the size of the sparse matrix used at each step. Namely, create a base sparse matrix at some broad threshold, M (e.g., Tanimoto 0.7, or 0.3 dissimilarity), choose a minimum threshold (e.g., Tanimoto 0.95), T, a step size (e.g., 0.01), S, and a stopping threshold, N. In principle, these matrices can be dissimilarity values from any data. Here we focus on fingerprint and shape Tanimoto values transformed to the Soergel dissimilarity.
Tiered Algorithm Preprocess steps: Preprocess steps: Create sparse matrix M (in parallel) at a threshold N Remove singletons M = M Input M,T, N 1.Cluster in parallel with threshold = T. 2.Pool cluster representatives and singletons into set V 3.Collect matrix information for V from M and create new M 4.Calculate the mean of all internal cluster distances for Kelleys Level Selection, and output the number of singletons and the number of clusters. 5.Set T = T + S 6.Repeat until T = N Compute Kelley Level Selection values over the span of iterations, normalized for the size of the data at each iteration. Output: Each iteration represents a clustering and the full results represents a forest of trees, the leaves containing the first cluster representatives, and each level, the results of successive iterations.
Data and Equipment Data set: PubChem Kinase Data, FAK 3 96,881 compounds, with 811 actives 90,784 compounds with salts and charged compounds removed 89,507 unique fingerprints (1.4% duplicates with Mesa 768 key bit fingerprints) Equipment: Single Alienware workstation with 4 gigabytes of RAM and 4, Intel 3.2 gigahertz XEON cores, running Suse bit OS.
Timings Parallel Matrix Generation for 89,507 fingerprints 24 minutes -- Sparse Matrix with 0.3 dissimilarity for 89,507 fingerprints versus 43 minutes sequentially Parallel Tiered Clustering including Kelleys 2.5 minutes -- from 0.1 to 0.3 dissimilarity with 0.01 step size (21 iterations) versus 8 minutes sequentially. IO times are included. Largest single clustering to date with proprietary data is ~6.7 million compounds: 3 weeks on a 32 node cluster running 2.0 gigahertz chips for the matrix generation, and 5 days on a single machine with 12 cores of 3.2 gigahertz chips with 128 gigabytes of RAM for the clustering. Using MDL 320 MACCS Key fingerprints.
Kelleys level selection
Using the Tiers Tiered output is a forest of general rooted trees … … …… ……… …… Nodes at Each level are a set of centrotypes Leaves are clusters or singletons containing all compounds. Height Of trees is the Tanimoto range: for example 0.7 top to 0.9 bottom cluster Centrotypes in levels of …… ……… …… … ……… ………… ……… … ……… …..
Hierarchies. Kinase Data: Centroids found at 0.76 Tanimoto with Tiered Clustering, clustered again with Group Average Hierarchical
Shape Cluster of Active Cluster
Acknowledgments Software OpenEye Scientific Software, Inc. OpenBabel Dalke Scientific, LLC References Taylor, R. Simulation Analysis of Experimental Design Strategies for Screening Random Compounds as Potential New Drugs and Agrochemicals, J. Chem. Inf. Comput. Sci 1995, 35, Butina, D. Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets. J.. Chem. Inf. Comput. Sci, 1999, 39(4), PubChem: Primary biochemical high-throughput screening assay for inhibitors of Focal Adhesion Kinase (FAK)http://www.ncbi.nlm.nih.gov/sites/entrez?db=pcassay&term=Kinase