Parallel tiered clustering for large data sets using a modified Taylors algorithm J. MacCuish 1 N. MacCuish 1, M. Chapman 1 1 Mesa Analytics & Computing,

Slides:



Advertisements
Similar presentations
Shape and Color Clustering with SAESAR Norah E. MacCuish, John D. MacCuish, and Mitch Chapman Mesa Analytics & Computing, Inc.
Advertisements

JKlustor clustering chemical libraries presented by … maintained by Miklós Vargyas Last update: 25 March 2010.
A Novel SAR-Driven Approach for Identifying True High-Throughput Screening Hits S. Frank Yan, Hayk Asatryan, Jing Li, Kaisheng Chen, and Yingyao Zhou Genomics.
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Clustering.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Post Silicon Test Optimization Ron Zeira
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Luddite: An Information Theoretic Library Design Tool Jennifer L. Miller, Erin K. Bradley, and Steven L. Teig July 18, 2002.
Object (Data and Algorithm) Analysis Cmput Lecture 5 Department of Computing Science University of Alberta ©Duane Szafron 1999 Some code in this.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
What is Cluster Analysis?
PCA Channel Student: Fangming JI u Supervisor: Professor Tom Geoden.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Birch: An efficient data clustering method for very large databases
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Health and CS Philip Chan. DNA, Genes, Proteins What is the relationship among DNA Genes Proteins ?
CSE 373 Data Structures Lecture 15
Data Mining Chun-Hung Chou
Identifying Reversible Functions From an ROBDD Adam MacDonald.
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Algorithm Analysis CS 400/600 – Data Structures. Algorithm Analysis2 Abstract Data Types Abstract Data Type (ADT): a definition for a data type solely.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Predicting Drug-gene and Drug-disease Networks using Functional Flow Bioinformatics Capstone Project School of Informatics Indiana University Bloomington,
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Image Segmentation in Color Space By Anisa Chaudhary.
Selecting Diverse Sets of Compounds C371 Fall 2004.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.
Design of a Compound Screening Collection Gavin Harper Cheminformatics, Stevenage.
Use of Machine Learning in Chemoinformatics
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
Computational Approach for Combinatorial Library Design Journal club-1 Sushil Kumar Singh IBAB, Bangalore.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Unsupervised Learning
Clustering CSC 600: Data Mining Class 21.
Distributed Network Traffic Feature Extraction for a Real-time IDS
A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets Ashok Sharma, Robert Podolsky, Jieping.
K-means and Hierarchical Clustering
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
John Nicholas Owen Sarah Smith
Hierarchical clustering approaches for high-throughput data
Objective of This Course
Statistics 2 for Chemical Engineering lecture 5
Multivariate Statistical Methods
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Nearest Neighbors CSC 576: Data Mining.
Text Categorization Berlin Chen 2003 Reference:
Hierarchical Clustering
Unsupervised Learning
Presentation transcript:

Parallel tiered clustering for large data sets using a modified Taylors algorithm J. MacCuish 1 N. MacCuish 1, M. Chapman 1 1 Mesa Analytics & Computing, Inc, Santa Fe, New Mexico, USA

Abstract Clustering large sets has many applications in drug discovery, among them compound acquisition decisions and combinatorial library diversification. Molecular fingerprints (2D) and molecular shape conformers (3D) from PubChem are the basic descriptors comprising the large sets utilized in this study. A parallel tiered clustering algorithm, implementing a modified Taylors algorithm, will be described as an efficient method for analyzing datasets of such large scale. Results will be presented in SAESAR (Shape And Electrostatics Structure Activity Relationships).

Motivation Though leader and related exclusion region clustering algorithms such as Taylor/Butina 1,2 clustering, are fast and can group millions of compound fingerprints in parallel, they suffer from the difficulty of finding an appropriate region threshold for the problem at hand. K-means clustering, also used for large scale clustering, suffers an analogous problem in the choice of K. Finding an appropriate threshold or choice of K for the data can be very computationally expensive, above and beyond the expense of clustering millions of compounds

Methods Algorithm: We use Taylors algorithm, modified to assign false singletons to their nearest respective cluster, and break exclusion region ties (clusters with the greatest membership have the same cardinality) by finding the most compact cluster. Additionally, the input to the algorithm is a sparse matrix composed of only those values that are reasonable dissimilarities. This helps on two counts, one the generation of the matrix can take into account other efficiencies in eliminating unnecessary comparisons; second, the sparse matrix greatly reduces internal memory and disk storage, often by a large constant factor (e.g., 100 times). The algorithm returns clusters, their respective representative elements (centroids or centrotypes), true singletons, and false singleton cluster assignments, and ambiguity statistics.

Tiered Taylors Taylors algorithm can then be used as a base algorithm to iteratively span a set of regular thresholds, successively reducing the size of the sparse matrix used at each step. Namely, create a base sparse matrix at some broad threshold, M (e.g., Tanimoto 0.7, or 0.3 dissimilarity), choose a minimum threshold (e.g., Tanimoto 0.95), T, a step size (e.g., 0.01), S, and a stopping threshold, N. In principle, these matrices can be dissimilarity values from any data. Here we focus on fingerprint and shape Tanimoto values transformed to the Soergel dissimilarity.

Tiered Algorithm Preprocess steps: Preprocess steps: Create sparse matrix M (in parallel) at a threshold N Remove singletons M = M Input M,T, N 1.Cluster in parallel with threshold = T. 2.Pool cluster representatives and singletons into set V 3.Collect matrix information for V from M and create new M 4.Calculate the mean of all internal cluster distances for Kelleys Level Selection, and output the number of singletons and the number of clusters. 5.Set T = T + S 6.Repeat until T = N Compute Kelley Level Selection values over the span of iterations, normalized for the size of the data at each iteration. Output: Each iteration represents a clustering and the full results represents a forest of trees, the leaves containing the first cluster representatives, and each level, the results of successive iterations.

Data and Equipment Data set: PubChem Kinase Data, FAK 3 96,881 compounds, with 811 actives 90,784 compounds with salts and charged compounds removed 89,507 unique fingerprints (1.4% duplicates with Mesa 768 key bit fingerprints) Equipment: Single Alienware workstation with 4 gigabytes of RAM and 4, Intel 3.2 gigahertz XEON cores, running Suse bit OS.

Timings Parallel Matrix Generation for 89,507 fingerprints 24 minutes -- Sparse Matrix with 0.3 dissimilarity for 89,507 fingerprints versus 43 minutes sequentially Parallel Tiered Clustering including Kelleys 2.5 minutes -- from 0.1 to 0.3 dissimilarity with 0.01 step size (21 iterations) versus 8 minutes sequentially. IO times are included. Largest single clustering to date with proprietary data is ~6.7 million compounds: 3 weeks on a 32 node cluster running 2.0 gigahertz chips for the matrix generation, and 5 days on a single machine with 12 cores of 3.2 gigahertz chips with 128 gigabytes of RAM for the clustering. Using MDL 320 MACCS Key fingerprints.

Kelleys level selection

Using the Tiers Tiered output is a forest of general rooted trees … … …… ……… …… Nodes at Each level are a set of centrotypes Leaves are clusters or singletons containing all compounds. Height Of trees is the Tanimoto range: for example 0.7 top to 0.9 bottom cluster Centrotypes in levels of …… ……… …… … ……… ………… ……… … ……… …..

Hierarchies. Kinase Data: Centroids found at 0.76 Tanimoto with Tiered Clustering, clustered again with Group Average Hierarchical

Shape Cluster of Active Cluster

Acknowledgments Software OpenEye Scientific Software, Inc. OpenBabel Dalke Scientific, LLC References Taylor, R. Simulation Analysis of Experimental Design Strategies for Screening Random Compounds as Potential New Drugs and Agrochemicals, J. Chem. Inf. Comput. Sci 1995, 35, Butina, D. Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets. J.. Chem. Inf. Comput. Sci, 1999, 39(4), PubChem: Primary biochemical high-throughput screening assay for inhibitors of Focal Adhesion Kinase (FAK)