Selecting Diverse Sets of Compounds C371 Fall 2004.

Slides:



Advertisements
Similar presentations
JKlustor clustering chemical libraries presented by … maintained by Miklós Vargyas Last update: 25 March 2010.
Advertisements

Clustering II.
Analysis of High-Throughput Screening Data C371 Fall 2004.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
AEB 37 / AE 802 Marketing Research Methods Week 7
Cluster Analysis.
Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Chemical Diversity Qualify and/or quantify the extent of variety within a set of compounds. Try to define the extent of chemical space. In combinatorial.
6-1 ©2006 Raj Jain Clustering Techniques  Goal: Partition into groups so the members of a group are as similar as possible and different.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis: Basic Concepts and Algorithms
What is Cluster Analysis?
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Molecular Descriptors
Combinatorial Chemistry and Library Design
Similarity Methods C371 Fall 2004.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
CLUSTER ANALYSIS.
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Virtual Screening C371 Fall INTRODUCTION Virtual screening – Computational or in silico analog of biological screening –Score, rank, and/or filter.
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
Clustering.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Identification of structurally diverse Growth Hormone Secretagogue (GHS) agonists by virtual screening and structure-activity relationship analysis of.
Computational Approach for Combinatorial Library Design Journal club-1 Sushil Kumar Singh IBAB, Bangalore.
1 Cluster Analysis Prepared by : Prof Neha Yadav.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Unsupervised Learning
Data Mining: Basic Cluster Analysis
Hierarchical Clustering
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
Hierarchical Clustering
Clustering.
Virtual Screening.
Clustering and Multidimensional Scaling
Multivariate Statistical Methods
Data Mining – Chapter 4 Cluster Analysis Part 2
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Clustering The process of grouping samples so that the samples are similar within each group.
SEEM4630 Tutorial 3 – Clustering.
Cluster analysis Presented by Dr.Chayada Bhadrakom
Hierarchical Clustering
Data Mining CSCI 307, Spring 2019 Lecture 24
Unsupervised Learning
Presentation transcript:

Selecting Diverse Sets of Compounds C371 Fall 2004

Review Similar Property Principle: If structurally similar compounds are likely to exhibit similar activity, then maximum coverage of the activity space should be achieved by selecting a structurally diverse set of compounds.

Techniques High-Throughput Screening (HTS) Combinatorial Chemistry Early attempts led to large libraries, but little variability in the molecules created Need a way to identify subsets of compounds for synthesis, purchase, or testing

Chemical Diversity No unambiguous definition Need to quantify the degree of diversity of a subset of compounds Four main approaches: –Cluster analysis –Dissimilarity-based methods –Cell-based methods –Use of optimization techniques

CLUSTER ANALYSIS Aim is to divide a group into clusters where objects in the cluster are similar, but objects in other clusters are dissimilar Many algorithms for doing this –Hierarchical methods seem to be better than non- hierarchical Sometimes called a “distance-based” approach to compound selection, because distance is measured between pairs of compounds

Key Steps in Cluster Analysis Generate descriptors for each compound Calculate the similarity or distance between all compounds Use a clustering algorithm to group the compounds Select a representative subset by taking one or more compounds from each cluster

“Distance” 1-S, where S is the similarity coefficient –When molecules are represented by binary descriptors Euclidean distance –When molecules are represented by physicochemical properties

Characteristics of Clustering Methods Non-overlapping: each object in one cluster only (Most use this approach) –Hierarchical methods –Non-hierarchical methods Overlapping: object can be in more than one cluster Efficiency and effectiveness issues: some approaches have very intensive computational requirements

Hierarchical Clustering Clusters increase in size, with each compound in a single cluster (a singleton) at one extreme –Agglomerative methods start at the bottom and merge similar clusters Ward’s method: clusters are formed to minimize the variance (i.e., the sum of the squared deviations from the mean) Others: centroid method and the median method –Divisive hierarchical clustering starts with all compounds in a single cluster and partitions the data

Selecting the Appropriate Number of Clusters Need a cutoff value at which you are going to examine the molecules –Jaccard statistic of two clusters, C 1 and C 2 a a + b + c Where a is the number of compounds found in both clusters, b is the number that cluster in 1 but not 2, and c is the number in 2 but not 1 –Same as the Tanimoto coefficient

Non-Hierarchical Clustering Compounds are clustered without forming a hierarchical relationship Methods: –single-pass assigns a compound to a cluster according to a cut-off value Problem: doesn’t give same results all of the time, i.e., dependent on the order of the molecules –nearest neighbor: Jarvis Patrick clustering –relocation: K-means

DISSIMILARITY-BASED SELECTION METHODS Attempt to identify a diverse set of compounds directly Based on calculating distances or dissimilarities between compounds

Basic Algorithm for Dissimilarity- Based Selection Methods Decide on a desired size, n, of a final subset Select a compound and place it in the subset Calculate the dissimilarity between each of the other compounds and those in the subset Choose the next compound as the one most dissimilar to those in the subset If fewer than n in the subset, repeat the calculation of the dissimilarity until n is achieved Complexity varies as the square of n

CELL-BASED METHODS Operate within a pre-defined low-dimensional chemistry space, not dependent on the particular set of molecules being examined Compounds are allocated to cells according to their molecular properties Methods are very fast with a time complexity of O(N), but restricted to low-dimensional space –good for very large data sets –Examples: MW, logP, polarity, shape, hydrogen bonding, aromatic interactions

BCUT Descriptors Matrix representation of molecules Atomic properties used for diagonal –Atomic charges, polarizabilities, hydrogen bonding Connectivity used for the off-diagonals –2D graph or interatomic distances from 3D

Partitioning Using Pharmacophore Keys Each potential 3- or 4-point pharmacophore is considered to constitute a cell A given molecule could be in more than one cell Promiscous molecules: those that contain a large number of pharmacophores, e.g., very flexible molecules

OPTIMIZATION METHODS Techniques for sampling large sets of molecules May want to spread the compounds evenly in space Techniques: Monte Carlo, simulated annealing Selective replacement

CONCLUSIONS Some research suggests that compounds within 0.85 Tanimoto similarity have between 30% and 80% chance of sharing the same biological activity No clear consensus on which screening approach is best Faster computer techniques (e.g., parallel computing) may help Descriptors used must be related to biological activity