C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Slides:



Advertisements
Similar presentations
SOMA2 – Drug Design Environment. Drug design environment – SOMA2 The SOMA2 project Tekes (National Technology Agency of Finland) DRUG2000 program.
Advertisements

Florida International University COP 4770 Introduction of Weka.
Analysis of High-Throughput Screening Data C371 Fall 2004.
Jürgen Sühnel Institute of Molecular Biotechnology, Jena Centre for Bioinformatics Jena / Germany Supplementary Material:
Future CAMD Workloads and their Implications for Computer System Design IEEE 6th Annual Workshop on Workload Characterization.
Establishing a Successful Virtual Screening Process Stephen Pickett Roche Discovery Welwyn.
Cheminformatics II Apr 2010 Postgrad course on Comp Chem Noel M. O’Boyle.
SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.
Jeffery Loo NLM Associate Fellow ’03 – ’05 chemicalinformaticsforlibraries.
A Study on Feature Selection for Toxicity Prediction*
Cloud Computing for Chemical Property Prediction Paul Watson School of Computing Science Newcastle University, UK Microsoft Cloud.
Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms.
An Extended Introduction to WEKA. Data Mining Process.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
OMICS Group Contact us at: OMICS Group International through its Open Access Initiative is committed to make genuine and.
Data Mining – Intro.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
Computational Techniques in Support of Drug Discovery October 2, 2002 Jeffrey Wolbach, Ph. D.
Knowledgebase Creation & Systems Biology: A new prospect in discovery informatics S.Shriram, Siri Technologies (Cytogenomics), Bangalore S.Shriram, Siri.
Asia’s Largest Global Software & Services Company Genomes to Drugs: A Bioinformatics Perspective Sharmila Mande Bioinformatics Division Advanced Technology.
Topological Summaries: Using Graphs for Chemical Searching and Mining Graphs are a flexible & unifying model Scalable similarity searches through novel.
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
Introduction to Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.
The identification of interesting web sites Presented by Xiaoshu Cai.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Bioinformatics Brad Windle Ph# Web Site:
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Open source software and web services for designing therapeutic molecules G. P. S. Raghava, Head Bioinformatics Centre, Institute of Microbial Technology,
Page 1 SCAI Dr. Marc Zimmermann Department of Bioinformatics Fraunhofer Institute for Algorithms and Scientific Computing (SCAI) Grid-enabled drug discovery.
The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
1 Cheminformatics David Shiuan Department of Life Science and Institute of Biotechnology National Dong Hwa University.
Physicochemical Properties of Drugs in relation to Drug Action Roselyn Aperocho Naranjo, RPh, MPH USPF, College of Pharmacy
QSAR Study of HIV Protease Inhibitors Using Neural Network and Genetic Algorithm Akmal Aulia, 1 Sunil Kumar, 2 Rajni Garg, * 3 A. Srinivas Reddy, 4 1 Computational.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Virtual Screening C371 Fall INTRODUCTION Virtual screening – Computational or in silico analog of biological screening –Score, rank, and/or filter.
TCOF 3 :Repositioning of Chemical compounds From Different Classes as part of Virtual Screening Under the Guidance of PI: Dr UCA JALEEL, Dr Bheemarao Ugarkar.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
ECCR Overview/MLSCN. NIH Roadmap Series of initiatives designed to pursue major opportunities in biomedical research and gaps in current knowledge that.
December 1, Classification Analysis of HIV RNase H Bioassay Lianyi Han Computational Biology Branch NCBI/NLM/NIH Rocky ‘07.
PharmaMiner: Geometric Mining of Pharmacophores 1.
An Exercise in Machine Learning
Introduction to Chemoinformatics and Drug Discovery Irene Kouskoumvekaki Associate Professor February 15 th, 2013.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Use of Machine Learning in Chemoinformatics
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Computational Approach for Combinatorial Library Design Journal club-1 Sushil Kumar Singh IBAB, Bangalore.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
TCOF 3 :Repositioning of Chemical compounds From Different Classes as part of Virtual Screening Under the Guidance of PI: Dr UCA JALEEL (IISc Research.
Improving compound–protein interaction prediction by building up highly credible negative samples Toward more realistic drug-target interaction predictions.
Docking and Virtual Screening Using the BMI cluster
Molecular Modeling in Drug Discovery: an Overview
Indiana University School of Indiana University ECCR Summary Infrastructure: Cheminformatics web service infrastructure made available as a community resource.
Page 1 Computer-aided Drug Design —Profacgen. Page 2 The most fundamental goal in the drug design process is to determine whether a given compound will.
SNS COLLEGE OF TECHNOLOGY
APPLICATIONS OF BIOINFORMATICS IN DRUG DISCOVERY
ADME/Tox PredictionTox Prediction. The characterization of Absorption, Distribution, Metabolism, and Excretion (also known as ADME) and Toxicity are essential.
Waikato Environment for Knowledge Analysis
Molecular Docking Profacgen. The interactions between proteins and other molecules play important roles in various biological processes, including gene.
Virtual Screening.
Machine Learning with Weka
Course Introduction CSC 576: Data Mining.
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Evaluating Classifiers for Disease Gene Discovery
Data Mining CSCI 307, Spring 2019 Lecture 7
Presentation transcript:

C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team

The burden of TB About 9 million people were infected with TB in year 2009, and 1.7 million died India is the world Tb capital with estimated 1.9 million cases reported every year. India has 2 nd largest estimated number of MDR-TB cases(99000 in 2008). By July 2010, 58 countries had reported at least 1 case of XDR-TB.

Cheminformatics : What? COMPUTERS have been applied to solve problems almost everywhere. When we use them in chemistry, we call it cheminformatics. Cheminformatics is applied mostly to large number of molecules. Deals with – Storage, retrieval and crosslinking of chemical structures and associated data. – Prediction of physical, chemical and biological properties of compounds. – Analysis and prediction of reactions. – Drug Design...

Steps in drug development Disease selectionTarget hypothesis Lead compound identification (screening) Lead optimizationPre-clinical trialClinical trial Pharmacogenomic optimization.

Cheminformatics in drug design Target Virtual Screening Data Data Mining Hit Identification Lead identification Building computational models for drug discovery process. Lead optimization

Aim of Cheminformatics Project To screen molecules interacting with the Potential TB targets using classifiers. Select the selected molecules and dock with Targets to further screen the molecules for leads. Use cheminformatics techniques such as QSAR,3D QSAR, ADMET to look for potential leads and design Drugs using the leads – by building combinatorial libraries.

Ways to perform Virtual screening Use a previously derived mathematical model that predicts the biological activity of each structure Run substructure queries to eliminate molecules with undesirable functionality Use a docking program to identify structures predicted to bind strongly to the active site of a protein (if target structure is known) Filters remove structures not wanted in a succession of screening methods

Main Classes of Virtual Screening Methods Depend on the amount of structural and bioactivity data available – One active molecule known: perform similarity search (ligand-based virtual screening) – Several active molecules known: try to identify a common 3D pharmacophore, then do a 3D database search – Reasonable number of active and inactive structures known: train a machine learning technique (with the help of Molecular descriptors or Molecular properties) – 3D structure of the protein known: use protein-ligand docking

Molecule Properties SPC : Structure Property Correlation INTRINSIC PROPERTIES Molar Volume Connectivity Indices Charge Distribution Molecular Weight Polar surface Area INTRINSIC PROPERTIES Molar Volume Connectivity Indices Charge Distribution Molecular Weight Polar surface Area CHEMICAL PROPERTIES pKa Log P Solubility Stability CHEMICAL PROPERTIES pKa Log P Solubility Stability BIOLOGICAL PROPERTIES Activity Toxicity Biotransformation Pharmacokinetics BIOLOGICAL PROPERTIES Activity Toxicity Biotransformation Pharmacokinetics

Molecular descriptors used for machine Learning Molecular descriptors are numerical values that characterize properties of molecules. The descriptors fall into Four classes a) Topological b) Geometrical c) Electronic d) Hybrid or 3D Descriptors

Descriptors Used For Classification Name of Descriptors used Number of Descriptors Pharmacophore Fingerprints 147 Weighted Burden Number 24 Properties8

Data mining According to David Hand et al., of MIT press (2001) “ Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner”. Data mining …. But why? Data  Information  Knowledge  The main aim of a user is always to extract knowledge from an information obtained from data.  Data mining is one of key step in Knowledge discovery process, although sometimes it is confused with Knowledge discovery itself!  A user always looks for more information search with least amount of time being spent on exploring the resources.

Data mining in Cheminformatics Data mining approaches are an integral part of cheminformatics and pharmaceutical research. This will tend to increase due to the increase of computational methods for biology and chemistry. Data mining has found major use in the virtual screening process of cheminformatics.

Data Mining Taxonomy

CLASSIFIER ALGORITHMS IS USED Bayes classifier Naïve bayes. Trees j48 Random forest Functions SMO

WORKFLOW

Accessing the HTS bioassay data Upload the sdf file All compounds sdf file Generate descriptor file Open the CSV file in Excel Bioassay result (all) Testing TrainingFile splitting Remove the useless attributes Select the actives and inactive compounds Apply classifier algorithms Selection of best classifier model TP %, FP 70% Append the bioassay result corresponding to the compounds PubChem PowerMV Excel WEKA

Molecular Descriptor generation Chemistry Development Kit (CDK) – PowerMV

PowerMv A Software Environment for Molecular Viewing, Descriptor Generation, Data Analysis and Hit Evaluation. An operating environment for biologists and statisticians for viewing or browsing medium to large molecular SD files, computing descriptors. 19

Features Importing, viewing and sorting SD files. Capacity is limited only by available memory. Compounds structure and attributes can be easily exported to Microsoft Excel.

Pre-requisites Requires.NET framework. Limitation Windows based

Weka - toolkit Collection of machine learning algorithms for data analysis and classification experiments. Tools available for data pre-processing, classification, regression, clustering, association rules, and visualization. 22

Weka – on GARUDA 23

The Script file RemoveUselessAttributes java -Xmx4000m weka.filters.unsupervised.attribute.RemoveUseless -i -o Using cost-sensitive classification java –Xmx4000m weka.classifiers.meta.CostSensitiveClassifier -cost-matrix “[ ; ]” -t AID1626train.arff -x 5 -d smo.model -W weka.classifiers.functions.SMO -i -- -M

Case Study: AID899 To get trained in using different classifiers in weka and analyzing the results

Cyp450 - a novel target against Mycobacterium tuberculosis

The P450s are mono-oxygenase enzymes, Generally interact with flavoprotein and/or iron–sulphur centre redox partners for catalysis The Mtb genome sequence—a plethora of P450s. ‘‘P450 dense’’ by comparison with eukaryotic genomes most effective azoles have extremely tight binding constants for one of the Mtb P450s (CYP121). Thus, analysis of Mtb CYP51 revealed P420 is an irreversibly inactivated and structurally disrupted species. Organism P450s Genome size Ratio Humans billion bp 1:5.8 million bp D. melanogaster million bp 1: 1.5 million bp A. thaliana has million bp 1: 462,000 bp M. tuberculosis204.4 million bp1: 220,000 bp Mutations were largely located not in the active site area itself, but instead in regions that are conformationally mobile, where entry and exit of substrate to the active site is facilitated Thus, acquired resistance could be mediated by mutations and it enhances flexibility and conformational rearrangements to increased activity Why Cyp450

Objectives To develop model from AID 899 HTS to study the compound/drug interaction with Human CYP450. Why 1)A lead molecule developed should not interact with CYP450 of human a) Drug metabolism b) affecting CYP450 2) It should work against CYP450 of M.tuberculosis

Work plan Select active/inactive compounds against human CYP450 from Pubchem HTS data Generate model for lead compound screening Screen the compounds via model Select the inactives Go for testing against mycobacterium CYP450 (model) Select active lead compound Go for insilico drug designing Invitro studies and invivo studies Current working To be worked

Confusion Matrix TP Active classified as active FN Active classified as inactive FP Inactive classified as active TN Inactive classified as inactive Base Classifier and Cost Sensitive Classifier (CSC) CSC  setting cost factor False Negative  TP, FP rate increases So FN is important than FP

Problem Faced Data Redundancy Computational Power Communication – need alternative to SKYPE Institutional limitations – Ban of media stream, social network, chatting, etc.

Data Redundancy Tried two approaches for processing the AID to obtain train and test data set. Method 1: We downloaded sdf file containing all tested compounds. We downloaded bioassay data files for the same. Then we matched it in MS excel. It contained active, inactive, inconclusive and discrepancy We further selected only active and inactive and ran in PowerMV to get csv Then after converting to arff we processed test and train from it. Loaded the two files in Weka and used different algorithms to build best model. Method 2: We download active and inactive SDF files separately from the same pubchem page. After processing in PowerMV both files were combined to form one. Then similar steps were followed as in Method 1. Problem: The number of final active and inactive compounds differ between the methods. ActiveInactiveDiscrepancyInconclusive Method I Method II Nil1279 AID not curated “Problem reported to pubchem“. Director will be looking at it.

Progress & Results 1)We understood the basic working with weka 2)How to derive results from confusion matrix 3)Ignored Classifier gives good results (LAZY) 4)Got good results with RANDOM FOREST, etc unlike reported in Virtual bioassay paper 5)Maximum accuracy of 86.16

Strategy followed From the preliminary investigation it is clear that AID 899 is not a properly curated dataset In method I many classifiers were applied and the results are represented below In method II still many classifiers can be run and results generated.

List of Best classifiers : Fp 75

sincere thanks to OSDD