Kristin P. Bennett Dept of Mathematical Sciences and Dept of Computer Sciences Rensselaer Polytechnic Institute www.rpi.edu/~bennek.

Kristin P. Bennett Dept of Mathematical Sciences and Dept of Computer Sciences Rensselaer Polytechnic Institute www.rpi.edu/~bennek

2 Science in 21 st Century Traditional Hypothesis Experiment Data Result Design Data analysis Process/Experiment Data No Prior Hypothesis Data Driven Data Driven “If your experiment needs statistics, you ought to have done a better experiment” Ernest Rutherford

3 Public Health Challenges Drug Design Predict bio-activities of small molecules Subtask: Drug metabolism Control of Infectious Diseases Use pathogen DNA fingerprinting to track and control disease Subtask: Tuberculosis

CYP450-mediated metabolism of drug-like molecules Charles Bergeron, Jed Zaretzki,, Curt Breneman, and Kristin Bennett NIH Molecular Roadmaps Initiative 1P20H6003899-01 Motivation Identify the problem Customized Machine Learning Method Results Conclusions

5 Drug Metabolism The rate limiting step in the metabolism of drugs by enzyme cytochrome CYP3A4 is hydrogen atom abstraction (removal). Clozapine molecule Clozaril pill

Motivation: Why is this important? CYP450 isozymes metabolize the majority of drugs in clinical use –3A4, 2D6, and 2C9 respectively metabolize 50%, 25%, and 16% of drugs on the market Prediction of metabolic sites on lead candidates can circumvent metabolic liabilities later in the discovery pipeline, as well as aid pro-drug design. While In vitro techniques are increasingly high throughput, the in silico identification of metabolic liability early on in the drug discovery process will allow for the prevention of taking forward certain drug candidates

Identifying the problem Developing a predictive model of regioselective metabolism by a CYP 450 isozyme Issues: –For a given molecule, only the site of metabolism with the fastest reaction rate is known –There is no information about relative rates of metabolism for other sites on the molecule –Relative reaction rates between different molecules are unknown

Identifying the Problem: A racing metaphor Race 2 Race 3 Race 1 Molecule of Lidocaine

Topologically equivalent groups of hydrogens are identified, where abstraction of any member of the group results in the same metabolite. Representation: Identifying distinct regions of a molecule - Metabolophores

Lidocaine Metabalophore 1 HHH H H HHH H HHH H HH HH HHHH H HHH Base atom Descriptors Atom Descriptors AM1 charge Hydrophobic moment Bond length Surface area Non-hydrogen bond count Hydrogen bond count Span Ring information Rotatable bonds Physical environment Distribution of atom types at 1, 2, 3 and 4 bonds away from base atom Lidocaine Green group designates the experimentally determined site of metabolism Metabalophore 2 Metabalophore 3Metabalophore 4 Metabalophore 5 Metabalophore 6 Metabalophore 7 Metabalophore 5

Molecule 1 Group 3 HHH HH Group 2 HH Group 4 H Molecule 2 Group 2 HHH Group 1 HHH HH Group 4 HH HH H Group 1 HHH Group 3 HH Molecule 3 Group 2 HHH Group 3 HH Molecule 4 Group 5 HHH H7H8 Group 2 HH Group 4 HH Group 1 HHH HH H HHH H Group 3 HHH HHH Group 6 H Customized Model: From Chemistry to Machine Learning First Try: Classification Is this hydrogen group abstracted or not? Separate abstracted groups from all other groups.

12 Almost Multiple Instance Classification Molecule 1 Group 3 HHH HH Group 2 HH Group 4 H Molecule 2 Group 2 HHH Group 1 HHH HH Group 4 HH HH H Group 1 HHH Group 3 HH Molecule 3 Group 2 HHH Group 3 HH Molecule 4 Group 5 HHH H7H8 Group 2 HH Group 4 HH Group 1 HHH HH H HHH H Group 3 HHH HHH Group 6 H Is a hydrogen in the group abstracted or not? Separate at least one hydrogen in each Abstracted Group from all other groups

Molecule 1 Group 3 HHH HH Group 2 HH Group 4 H Molecule 2 Group 2 HHH Group 1 HHH HH Group 4 HH HH H Group 1 HHH Group 3 HH Molecule 3 Group 2 HHH Group 3 HH Molecule 4 Group 5 HHH H7H8 Group 2 HH Group 4 HH Group 1 HHH HH H HHH H Group 3 HHH HHH Group 6 H Which group with the molecule will be preferred? MIRank finds a single ranking function across multiple molecules Learning Model: Multiple Instance Ranking (Bergeron et al., 2008)

It is posed as a bilinear optimization problem The source code, data and paper are available online http://reccr.chem.rpi.edu/MIRank/ Empirical risk Regularization Tradeoff parameter Convex combination weights sum to one. Convex combination weights are nonnegative. Empirical risk terms are nonnegative. Bilinear constraint. Model Learning Model: Multiple Instance Ranking (Bergeron et al., 2008)

–Our descriptors and modeling techniques take advantage the inherent molecule /metabolophore structure of the problem to effectively utilize limited experimental information. –Results statistically equivalent to previously published results –Predictions published by Sheridan utilize methods proprietary to Merck, while Metasite is a commercial product. –We are developing our method into publically available tool for online metabolic site predictions. Results: Comparison with other methods

16 Results in Blind test for Major Pharmaceutical Company “Long story short, we're very impressed with the predictions for this preliminary test set of 20 compounds. If we had not had experimental data yet for these compounds, the predictions would have been very useful in directing our chemistry teams to the major or minor metabolic hot-spot for a large majority of the compounds.” (85% accuracy)

Most accurate public domain method for hydrogen abstraction (online prototype). Metabolite can be accurately determine using predicted metabalophore. Customize machine learning to the task Model enhancements: Nonlinear Kernel function, Multi-task learning across isozymes, Multi-level model Algorithm enhancements: New faster class of nonsmooth nonconvex bundle methods for multiple instance learning. Conclusions and Directions :

Tuberculosis Tracking and Control Amina Shabeer, Cagri Ozgalar, S. Vandenberg, B. Yener, L. Cowan, J. Driscoll, K. Bennett and more at CDC, NYCDOH, NYDOH, PHRI, Institut Pasteur NIH R01 LM009731 cs.rpi.edu/~bennek/tbtrack

19 More than 8 million new cases, 2.5 million deaths a year worldwide WHO: 1/3 of world population is infected Strong association with HIV epidemics, poverty Emergence of multidrug-resistant strains Extremely difficult to control Goal: Use DNA fingerprinting of TB bacteria to track spread of TB, detect new outbreaks, guide control efforts Tackling Tuberculosis

Genotyping helps TB Control 20 Two students/employees sick with TB. TB Controller: Find source(s) of infection in order to identify people who need treatment and stop future transmission. Genotype TB bacteria to see if patients are part of the same outbreak.

Identify the Problems Extract information valuable to TB control efforts in NYC beyond “match or no match” Determine major phylogenetic lineages Visualize genotype and patient information to find “clusters of interest” (outbreaks) Spoligoforests Patient/genotype clusters 21

Use M. Tuberculosis Complex DNA fingerprinting Insertion sequence 6110 restriction fragment length polymorphism (IS6110-RFLP) Polymorphic GC-rich sequence – RFLP Spacer oligonucleotide typing (Spoligotyping) Mycobacterial interspersed repetitive units (MIRU) Single nucleotide polymorphism (SNP) Large sequence polymorphism (LSP) Spoligotyping + MIRU - Routinely collected nationwide as part of TB surveillance data. New York City also has IS6110-RFLP Need to culture for several weeks PCR based

NYC Data from 2001-2008 4984 Patients 137 Countries 793 Spoligotypes, 2648 RFLPs 3235 Distinct Genotypes 594 “Named” Clusters MIRU also available but incomplete 23

24 Direct repeats (DR) separated by variable spacers Contiguous on chromosome, order well conserved Forty three spacers used Presence of a spacer is detected: 1- present ( ), 0 - absent ( ) M. tuberculosis Beijing M. bovis StrainBinary description of spoligotypes DR spacer DVR DNA Fingerprint: Spoligotyping

Major Genetic Lineages Lineage # of Spols # of Patients East Asian 17812 East-African Indian 60272 Euro- American 5263393 Indo- Oceanic 103391 M. africanum1961 M. bovis 1855 Total7434984 Major genetic lineages as determined LSPs and SNPs widely accepted How do you determine these based only on spoligotyes/MIRU? 25 Lineages in NYC

Mtb highly clonal. Evolution of spoligotypes is slow. One or more contiguous spacers are lost in one evolutionary event. Distinct phylogeographic groups. Dollo Parsimony Assumption: Once lost, spacers are never regained StrainBinary description of spoligotypes Evolution of Spoligotypes East Asian M. bovis Indo-Oceanic

TB-LinRules: Determines Lineages Precise rules use two types of features Deletion of contiguous spacers MIRU Locus 24 If MIRU 24 >1 then ancestral otherwise modern. Beta Version: http://www.cs.rpi.edu/~bennek/ tbtrack/ 27 Refines and clarifies rules developed from literature analysis by Dr. L. Cowan of US CDC.

28 Genetic Diversity of TB in US Each node = Spoligotype Size = # of patients (log) Colors = 6 genetic lineages 6 genetic lineages37Kpatients

99.9% Rules Match Lineage on US CDC Database ~37K CDC Isolates Also tested on MIRU-VNTRPLUS.org Datasets with >99% accuracy

NYC Isolates 30

Phylogeographic Distribution Ancestral Strains Modern Strains

M. Bovis in NYC 32

East Asian in NYC 33

Indo Oceanic 34

Euro American

Identification of Sub-families Sub-families needed for further subgroup identification. No complete deliniation of subfamilies exists so unsupervised or semi- supervised learning. SPOTCLUST (Vitol, et al 2006) Generative Mixture Model

SPOTLCUST Hidden Parent Model captures one step of evolution Model infers hidden parents near leaves of tree without inferring phylogeny. Allow additional loss of spacers with low probability. One Hidden parent for each family Unknown

Subfamily Probability Model Visual Rule Generalizes to probability model Color represents probability spacer is on. Multivariate Bernoulli distribution: assumes each spacer is independent within a subfamily. M. africanum

41 Population is a mixture of subfamilies Bayesian Network Identifies 36 subfamilies using Spoligotypes

Bernoulli mixture model without Hidden Parent Bernoulli mixture model with Hidden Parent Prototype M.tuberculosis Haarlem2 Family

Region of Birth

Time in US versus Age at Diagnosis

Age at Diagnosis of US-Born

The NYC M. bovis Mystery Extra pulmonary M. bovis strikes –Mexican Immigrants –US-born children of Mexican Immigrants Hypothesized caused: Unpasturized cheese

50 Conclusions –Machine Learning extracts critical information from public health databases. –Customized Machine Learning Methods for Enhance drug discovery Infectious disease tracking and control –Framing the right question is half the battle. –Need innovative solutions on wide range of learning tasks – (semi)(un)supervised, visualization –Goals are robust automated tools in the hands of front line users. see http://reccr.chem.rpi.edu http://www.cs.rpi.edu/~bennek/tbtrack

For each molecule up to 25 conformations were generated using the standard Stochastic Search function offered within MOE Hydrogen specific and Base atom descriptors were then generated for each conformation using MOPAC and MOE The descriptor values for each conformation were then Boltzmann weighted Energy of an individual conformation Energy over all computed conformations Procedure: Descriptor Generation

Population is mixture of families Population is 5% 73% Beijing M. africanum

–Our descriptors and modeling techniques take advantage the inherent molecule / topological group / hydrogen structure of the problem –MIRank yield results statistically equivalent to previously published results –Sheridan’s method is proprietary to Merck, Metasite is commercial, our method is public source –Currently developing an online site for public molecular site prediction Results: Comparison with other methods

Identifying the problem: Potential Mechanisms

Customized machine learning models of all different flavors Supervised –classification, regression, ranking –multi-instance, multi-task –SVM, Bayes Nets, Expert Systems Unsupervised –Clustering, Anomaly Detection, Tensors, Visual Analytics, Networks Semi-supervised – Data Fusion Descriptor representation and selection Unsupervised, Semi-supervised

After tradeoff parameter is selected, validation set is folded into training set to determine final model A molecule is considered correctly predicted if the experimental site of metabolism is ranked first or second Procedure: MIRank Implementation

Motivation: What’s come before Reactivity Based – ligand only QSAR-based regioselectivity models using random forest algorithm (Sheridan et al., 2007) AM1 Semi-empirical calculations (Singh et al., 2003) used to estimate the energy necessary to abstract a hydrogen atom from a substrate Recognition Based – ligand and enzymatic structure MetaSite reactivity and recognition based application (Cruciani et al., 2005) utilizing GRID molecular interaction fields (Goodford et al., 1985) Docking algorithms, Dock (Ewing et al., 2001), Glide (Friesner et al., 2004), and GLUE (Zamora et al., 2006)

Euro American

Kristin P. Bennett Dept of Mathematical Sciences and Dept of Computer Sciences Rensselaer Polytechnic Institute www.rpi.edu/~bennek.

Similar presentations

Presentation on theme: "Kristin P. Bennett Dept of Mathematical Sciences and Dept of Computer Sciences Rensselaer Polytechnic Institute www.rpi.edu/~bennek."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kristin P. Bennett Dept of Mathematical Sciences and Dept of Computer Sciences Rensselaer Polytechnic Institute www.rpi.edu/~bennek.

Similar presentations

Presentation on theme: "Kristin P. Bennett Dept of Mathematical Sciences and Dept of Computer Sciences Rensselaer Polytechnic Institute www.rpi.edu/~bennek."— Presentation transcript:

Similar presentations

About project

Feedback