Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chemoinformatics: Where It Has Come From, Where It Is Now And Where It Is Going Peter Willett University of Sheffield, UK.

Similar presentations


Presentation on theme: "Chemoinformatics: Where It Has Come From, Where It Is Now And Where It Is Going Peter Willett University of Sheffield, UK."— Presentation transcript:

1 Chemoinformatics: Where It Has Come From, Where It Is Now And Where It Is Going Peter Willett University of Sheffield, UK

2 Overview Chemoinformatics: what is it? Historical development Current status Current research and teaching in chemoinformatics

3 Information Systems An important part of computer science –Database management systems –Information retrieval systems –Multimedia information systems –Knowledge-based systems Domain-specific information systems –Geographic information systems (cartographic information) –Biological information systems (biological sequences) –Chemical information systems (chemical structures)

4 Chemical Information Systems The first information systems and services were paper-based –Annalen der Pharmacie founded by Justus Liebig in 1832 –Chemical Abstracts started in 1907 Computer-based chemical information systems, both public and private, have been under development for over 40 years Emergence of chemoinformatics –chemical informatics, chemical information management/science, chemiinformatics, cheminformatics

5 Chemoinformatics: Some Definitions “The use of information technology and management has become a critical part of the drug discovery process. Chemoinformatics is the mixing of those information resources to transform data into information and information into knowledge for the intended purpose of making better decisions faster in the area of drug lead identification and organization” FK Brown, Annual Reports in Medicinal Chemistry, 33, 375-384 (1998) “Chem(o)informatics is a generic term that encompasses the design, creation, organization, management, retrieval, analysis, dissemination, visualization and use of chemical information” G Paris (August 1999 ACS meeting), quoted by WA Warr at http://www.warr.com/warrzone.htm Chemoinformatics - a new name for an old problem? M Hann and R Green, Current Opinion in Chemical Biology, 3, 379-383 (1999)

6 Milestones In The Development Of Chemical Information First attempts at structure representation and searching Efficient implementation of substructure searching Similarity searching 3D substructure search See http://www.libsci.sc.edu/bob/chemnet/chchron.htm for a more detailed history

7 Milestones In Chemical Information: I Work at NBS first demonstrated atom-by-atom searching (1957) –Identification of a user-defined substructural pattern in a database structure –Use of a graph-searching algorithm –Principal representation for most subsequent systems for the storage and retrieval of 2D chemical structures (later, 3D also) Start (1960) of NSF funding to Chemical Abstracts Service to develop methods for storing and searching both structural and textual information, e.g., production of Chemical Titles (1961) and Morgan algorithm for generating unique chemical graphs (1965) Institute for Scientific Information produces first issue of Index Chemicus, based on Wiswesser Line Notation (1960)

8 Milestones In Chemical Information: II Sussengüth algorithm for substructure searching (Harvard, 1965) –Introduced idea of set reduction strategies that can drastically prune the search trees involved in atom-by-atom searching –Insufficient, on its own, to enable rapid substructure searching Need for screening (indexing) strategies that can eliminate much of a database without inspection –Use of substructural fragments for screening (Sheffield and NIH, 1965-75) based on statistical selection criteria First batch (1975) and then online (1980) systems for substructure searching of CAS Registry System (which holds all structures reported in Chemical Abstracts)

9 2D Substructure Searching

10 Milestones In Chemical Information: III Substructure search is appropriate for highly constrained database querying –Need for best match, browsing capabilities (consideration of “similar property principle” and “neighbourhood behaviour”) Need to define similarity, an inherently subjective concept, in quantitative terms for search purposes –Graph-based approach to identify maximal common substructures –Greater efficiency of fragment-based measures of similarity led to their rapid adoption following initial work at Lederle and Pfizer (1985-86)

11 2D Similarity Searching

12 Milestones In Chemical Information: IV Structure diagrams are planar but molecules are not, so need to extend existing 2D screening and graph-search methods to allow 3D substructure searching (Pfizer and Lederle, 1986-87) Sources of 3D structural data –Experimental data (Cambridge Structure Database) –Computational chemistry (quantum mechanics, molecular mechanics, molecular dynamics) –Structure-generation methods for databases of molecules CONCORD (Texas, 1987) CORINA (Munich/Erlangen, 1990) Further extensions to allow flexible searching (ICI, MDLI and Tripos, 1991-94)

13 3D Substructure Searching

14 From Chemical Information To Chemoinformatics Need to go beyond simple archival functions (storage and retrieval) by exploiting the information so as to assist more directly in the discovery of novel bioactive molecules Drivers –Integration with techniques from molecular modeling (both scientifically and organisationally in many cases) –Developments in computer hardware and software –Data explosion arising from developments in combinatorial chemistry and high-throughput screening Other types of -informatics becoming common

15 Milestones In The Development Of Molecular Modelling Quantitative structure-activity relationships (QSAR) Pharmacophore mapping Docking 3D QSAR

16 Milestones In Molecular Modelling: I Statistical correlation of biological activity with molecular features Hansch analysis (Pomona College, 1964) –Use of physicochemical properties characterising steric, electrostatic and hydrophobic properties of molecules Free-Wilson analysis (SmithKlineFrench, 1964) –Use of variables denoting the presence or absence of substructural features in related molecules –Development for more diverse molecules (SmithKlineFrench, 1973)

17 Milestones In Molecular Modelling: II Identification of pharmacophoric patterns by analysis of active molecules to find the structural features in common Need for effective programs for conformational analysis –Active analogue approach of Marshall (Washington, 1979) to find appropriate separations of known sets of pharmacophore points –DISCO program to find possible sets of points (Abbott, 1993)

18 Milestones In Molecular Modelling: III Positioning of a putative ligand into a protein’s active site, first attempted by the DOCK program (UCSF, 1982) Initially restricted to rigid ligands and rigid proteins: current programs permit some degree of flexibility Use in structure-based design –Move from docking a single ligand to sequential docking of large datasets

19 Milestones In Molecular Modelling: IV Use of 3D information in QSAR to facilitate structure-based approaches to drug discovery COmparative Molecular Field Analysis (Tripos 1988), and related approaches –Calculate energies at points on a 3D grid surrounding a molecule –Statistical correlation with activity to identify important positions in space –Need for alignment

20 The Emergence Of A Discipline The first, and still the core, journal for the subject, the Journal of Chemical Documentation, started in 1961 (the name changed to the Journal of Chemical Information and Computer Science in 1975) The first book on the subject appeared in 1971 (Lynch, Harrison, Town and Ash, Computer Handling of Chemical Structure Information) The first international conference on the subject was held in 1973 at Noordwijkerhout, and every three years since 1987 Many contributors to this emergence but the first full university course not till 2000

21 Current Activities The ability to synthesise and to assay vastly greater numbers of molecules than even a very few years ago requires tools to rationalise the resulting structural and biological data Scale-up of previous work (new algorithms and faster hardware) Development of new methods (random/rational argument) –Molecular diversity analysis –Virtual screening

22 Current Activities: Molecular Diversity Analysis Structurally similar molecules will tend to have the same properties Considerations of cost-effectiveness suggest the need to focus, initially at least, on structurally diverse sets of molecules (so as to avoid generating redundant data) Substantial efforts (e.g., Chiron, Texas, Tripos) to devise tools for –Identifying appropriate structural representations –Selecting sets of molecules –Quantifying diversity

23 Current Activities: Virtual Screening Need to prioritise the many molecules that could be tested Increasingly sophisticated level of filtering to maximise the numbers of potential leads –“Drugability” considerations –Similarity searching (both 2D and 3D) using initial weak leads –3D substructure searching once possible pharmacophoric patterns have been identified –Docking once the 3D structure of the biological target is available

24 Challenges Extensions of current activities –Extension of filtering, e.g., ADMET (Adsorption, Distribution, Metabolism, Excretion and Toxicity) –Improved scoring functions in virtual screening, e.g., enhanced flexibility in ligand docking Data mining tools to complement existing approaches (graph theory, cluster analysis, genetic algorithms, etc.) –Decision trees, neural networks, visualisation Integration of chemoinformatics with other types of information processing –Laboratory information management systems –Datasets from -omics research Other application domains in the molecular sciences –materials science, food science (nutraceuticals), atmospheric chemistry, polymer chemistry

25 Graph Theory Graph theory is a branch of mathematics that considers sets of objects, called nodes, and the relationships, called edges, between pairs of these objects The definition is completely general, allowing graphs to be used in many different application domains as long as an appropriate representation can be derived Isomorphism procedures enable the comparison of pairs of graphs –Maximal common subgraph isomorphism enables the identification of the largest subgraph common to a pair of graphs –An extremely time-consuming computational problem, but reasonably efficient algorithms are available for small, labelled graphs

26 Examples Of Graphs

27 Searching 3D Protein Structures Searching protein sequences is well established: how to search the 3D structures in the Protein Data Bank (PDB)? Extensive collaboration between Information Studies and Molecular Biology and Biotechnology to develop graph representations of proteins that can be searched with isomorphism algorithms analogous to those used for chemical structures Focus here on folding motifs (secondary structure elements) in proteins but others –Protein amino acid sidechains –Carbohydrates –Nucleic acids

28 Representation Of Protein Folding Motifs: I The helix and strand secondary structure elements (SSE) are both approximately linear, repeating structures, which can hence be represented by vectors drawn along their major axes The nodes of the graph are these vectors and the edges comprise: –The angle between a pair of vectors –The distance of closest approach of the two vectors –The distance between the vectors’ mid-points PROTEP compares such representation using a maximal common subgraph isomorphism algorithm to identify common folds

29 Representation Of Protein Folding Motifs: II

30 Structural Relationship Between Leucine Aminopeptidase And Carboxypeptidase A Use of 1LAP as the target for a PROTEP search requiring structures with at least 7 SSEs in common with the target The four carboxypeptidase structures in the PDB at that time have a fold containing five helices and eight strands in a sheet in common with 1LAP The matched SSEs (in 5CPA) contain 86 residues with alpha-carbon RMSD of 1.77 Angstroms, but only 7% sequence homology for the equivalenced residues

31 Structural Relationship Between Leucine Aminopeptidase And Carboxypeptidase A CPA LAP CPA

32 Fusion Of Similarity Coefficients Many ways of computing the similarity between two molecules –Different representations –Different similarity coefficients –Different weighting schemes The standard approach is to use 2D fragment bit-strings in conjunction with the Tanimoto Coefficient –This seems to work reasonably well in practice, but can the performance be improved? –Adoption of data fusion techniques to combine rankings from different coefficients

33 Examples Of Different Similarity Coefficients Tanimoto Coefficient Cosine Coefficient Hamming Distance

34 Comparison Of Similarity Coefficients Select a target structure and rank the database in decreasing order of one of 22 different similarity coefficients Look at the top-50 (or whatever) structures for each of the searches and hence calculate the inter-coefficient similarity Cluster the resulting similarity matrix to find coefficients that cluster together Analysis of the resulting data (60 different target structures; 3 different clustering methods; top-50, top-100, top-200 and top-400 structures) shows that many of the coefficients near-monotonic –Only need to consider 12 of the 22 coefficients

35 Data Fusion Improved performance can be obtained in many classification tasks by combing evidence from several different sources Originally developed for signal processing but applied in textual information retrieval –Calculate the similarity between a user query and each of the documents in a database, –Rank the documents in order of decreasing similarity –Repeat using several different representations, coefficients, etc. –Add the rank positions for a given document to give an overall fused rank position –The resulting fused ranking is the output from the search –Small, but consistent, improvements in performance over use of a single ranking

36 Fusion Of Chemical Similarity Coefficients Use of the NCI AIDS database –37K structures that have been tested for anti-AIDS activity, with 300 strongly active –20 of the actives chosen as target structures –12 different similarity coefficients (based on previous analysis of 22 different ones) Searches were carried out using each of the 12 coefficients, and the resulting rankings fused to give rankings corresponding to all combinations of 1, 2, 3…10, 11, 12 coefficients The effectiveness of a combination was evaluated by the number of actives in the top-400 positions of the fused ranking

37 Results Of Fused Searches

38 Conclusions Fusion of rankings can provide a small, but consistent, improvement in the effectiveness of searching if an appropriate combination of coefficients is chosen The principal computational cost of a similarity search is determining the numbers of bits in common to a pair of fragment bit-strings; once this number is known, calculation of the actual coefficient is trivial Data fusion of this sort provides a simple, but highly cost-effective, way of enhancing existing systems for chemical similarity searching

39 Education and Training To Meet Industrial Needs In Chemoinformatics Detailed technical skills for a relatively small number of people –Perhaps a score of people a year become available world-wide from a very limited number of research groups in Germany, the USA and the UK –Other chemistry PhDs who do a fair amount of computing (e.g., crystallography, computational chemistry) where people can be taught the appropriate skills on the job Base-level awareness of the technologies available –In-house training and short courses organised by database and software suppliers (e.g., CAS and MDLI) –Introductory courses, e.g, Molecular Graphics and Modelling Society course at York –General undergraduate modules, e.g., chemical information or molecular modelling

40 Educational Programmes In Chemoinformatics Both UG and PG courses in computational chemistry are quite common, but no dedicated chemoinformatics training till recently (cf first MSc in bioinformatics started at Manchester in 1994) Sudden burst of interest –PG courses in Chemoinformatics (Sheffield 2000) and ChemInformatics (UMIST 2001) –Both UG and PG courses (Indiana 2001) –Planning for PG courses at Baylor University (Waco) and Polytechnic University (Brooklyn) –Other developments, e.g., Unilever Centre for Molecular Informatics (Cambridge) and Beilstein chair (Frankfurt)

41 Development Of The Sheffield MSc: I The last few years have seen a rapid growth in the positions available (both publicly and privately) –Now a world-wide lack of staff (see Chemical Engineering News, New Scientist, Warr paper at August 1999 ACS meeting) The Engineering and Physical Sciences Research Council (EPSRC) is one of the main sources of research funding in the UK, and also supports industry-relevant postgraduate training –Meeting with, and then briefing paper for, EPSRC’s Director of Chemistry (Q4 1998) –EPSRC discussions with chemical and then pharmaceutical companies (Q1/2 1999) –Acceptance of industrial need following round-table meeting in June 1999

42 Development Of The Sheffield MSc: II November 1999 EPSRC call for proposals for new MSc programmes (2001-06) included Chemoinformatics as a priority area After much hassle and delay, we received the go-ahead for a September 2000 start in mid-July 2000, with funding (ca. $800K) to cover –Tuition fees –Substantial part of student maintenance costs –Small contribution to teaching costs –Publicity, equipment and odds-and-ends

43 Principal Characteristics Of The Sheffield MSc Collaborative nature –Teaching input principally from Information Studies (and hence -informatics component paramount) –Substantial input from Computer Science and from Chemistry; also input from Automatic Control and Systems Engineering and from Molecular Biology and Biotechnology –Consortium of 13 companies (agrochemical, database, pharmaceutical and software) A three-part MSc programme with the first two based in Sheffield and with the dissertation then being carried out on-site at a company (who fund and supervise the student during this period)

44 Components Of The Sheffield MSc Semester-I –Introduction to programming (Java) –Information systems analysis and design –Information retrieval –Chemoinformatics-I (principles plus introduction to bioinformatics) Semester-II –Software engineering (Java) –Molecular modelling –Chemoinformatics-II (applications plus data mining methods) –Multimedia or database design or advanced information retrieval Semester-III –Dissertation project

45 Components Of The UMIST MSc Database design and programming Computer-aided drug and molecular design Chemical information sources Cheminformatics applications Combinatorial chemistry Spectroscopy and drug discovery Management and intellectual property Molecular modelling Research dissertation

46 Acknowledgements Protein structure matching –Peter Artymiuk, Helen Grindley, Biotechnology and Biological Sciences Research Council Fusion of similarity coefficients –John Holliday, Chung-Wu Hu MSc course –Val Gillet, John Holliday, Engineering and Physical Sciences Research Council


Download ppt "Chemoinformatics: Where It Has Come From, Where It Is Now And Where It Is Going Peter Willett University of Sheffield, UK."

Similar presentations


Ads by Google