Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ioerger Lab – Bioinformatics Research

Similar presentations


Presentation on theme: "Ioerger Lab – Bioinformatics Research"— Presentation transcript:

1 Ioerger Lab – Bioinformatics Research
Pattern recognition/machine learning issues of representation effect of feature extraction, weighting, and interaction on performance of induction algorithm Applications in Structural Biology molecular basis of biology: protein structures predicting structures tools for solving structures (X-ray crystallography, NMR) stability, folding, packing, motions drug design (small-molecule inhibitors) large datasets exist – exploit them – find the patterns

2 TEXTAL - Automated Crystallographic Protein Structure Determination Using Pattern Recognition
Principal Investigators: Thomas Ioerger (Dept. Computer Science) James Sacchettini (Dept. Biochem/Biophys) Other contributors: Tod D. Romo, Kreshna Gopal, Erik McKee, Lalji Kanbi, Reetal Pai & Jacob Smith Funding: National Institutes of Health Texas A&M University

3 X-ray crystallography
Most widely used method for protein modeling Steps: Grow crystal Collect diffraction data Generate electron density map (Fourier transform) Interpret map i.e. infer atomic coordinates Refine structure Model-building Currently: crystallographers Challenges: noise, resolution Goal: automation

4 X-ray crystallography
Most widely used method for protein modeling Steps: Grow crystal Collect diffraction data Generate electron density map (Fourier transform) Interpret map i.e. infer atomic coordinates Refine structure Model-building Currently: crystallographers Challenges: noise, resolution Goal: automation

5 Overview of TEXTAL Automated model-building program
Can we automate the kind of visual processing of patterns that crystallographers use? Intelligent methods to interpret density, despite noise Exploit knowledge about typical protein structure Focus on medium-resolution maps optimized for 2.8A (actually, A is fine) typical for MAD data (useful for high-throughput) other programs exist for higher-res data (ARP/wARP) Electron density map (or structure factors) Protein model (may need refinement) TEXTAL

6 LOOKUP: model side chains CAPRA: models backbone
Crystal Collect data Diffraction data Electron density map LOOKUP: model side chains CAPRA: models backbone SCALE MAP TRACE MAP CALCULATE FEATURES PREDICT Cα’s BUILD CHAINS PATCH & STITCH CHAINS REFINE CHAINS Model of backbone Model of backbone & side chains POST-PROCESSING SEQUENCE ALIGNMENT REAL SPACE REFINEMENT Corrected & refined model

7 F=<1.72,-0.39,1.04, > F=<1.58,0.18,1.09, > F=<0.90,0.65,-1.40, > F=<1.79,-0.43,0.88, >

8 Examples of Numeric Density Features
Distance from center-of-sphere to center-of-mass Moments of inertia - relative dispersion along orthogonal axes Geometric features like “Spoke angles” Local variance and other statistics Features are designed to be rotation-invariant, i.e. same values for region in any orientation/frame-of-reference. TEXTAL uses 19 distinct numeric features to represent the pattern of density in a region, each calculated over 4 different radii, for a total of 76 features.

9 The LOOKUP Process Find optimal rotation Database of known maps
Two-step filter: 1) by features 2) by density correlation “2-norm”: weighted Euclidean distance metric for retrieving matches: Region in map to be interpreted

10 SLIDER: Feature-weighting algorithm
Euclidean distance metric used for retrieval: relevant features – good, irrelevant features – bad Goal: find optimal weight vector w the generates highest probability of hits (matches) in top K candidates from database Concept of Slider: adjust features so the most matches are ranked higher than mismatches Slider Algorithm(w,F,{Ri},matches,mismatches) choose feature fF at random for each <Ri,Rj,Rk>, Rjmatches(Ri),Rkmismatches(Ri) compute cross-over point li where: dist’(Ri,Rj)=dist’(Ri,Rk) dist’(X,Y)= l(Xf-Yf)2+(1-l)dist\f(X,Y) pick l that is best compromise among li ranks most matches above mismatches update weight vector: w’update(w,f,l), wf’=l repeat until convergence

11 Quality of TEXTAL models
Typically builds >80% of the protein atoms Accuracy of coordinates: ~1Å error (RMSD) Depends on resolution and quality of map

12 Closeup of b-strand (TEXTAL model in green)

13 Deployment September 2004: Linux and OSX distributions
Can be downloaded from 40 trial licenses granted so far June 2002: WebTex ( Till May 2005: TB Structural Genomics Consortium members only Recently open to the public users upload data; processed on server; can download results 120 users from 70 institutions in 20 countries July 2003: Model building component of PHENIX Python-based Hierarchical ENvironment for Integrated Xtallography Consortium members: Lawrence Berkeley National Lab University of Cambridge Los Alamos National Lab Texas A&M University

14 Intelligent Methods for Drug Design
structure-based: given protein structure, predict ligands that might bind active site other methods: QSAR, high-throughput/combi-chem, manual design using 3D Virtual Screening docking algorithm + large library of chemical structures sort compounds by interaction energy purchase top-ranked hits and assay in lab looking for mM inhibitors (leads that can be refined) goal: enrichment to ~5% hit rate

15 Virtual Screening diversity ZINC database: ~2.6 million compounds
purchasable; satisfy Lipinski’s rules docking algorithms: FlexX, DOCK, GOLD, AutoDock, ICM... search for position and conformation of ligand scoring function electrostatic + steric + desolvation entropy effects? major open issues: active site flexibility, charge state, waters, co-factors works best with co-crystal structures (already bound)

16 Grid at Texas A&M ~1600 computers in student labs on TAMU
gridmaster.tamu.edu DOCK binaries + receptor files + 20 ligands at a time West Campus Library typical configuration: 2.8 GHz dual-core Pentium CPUs running Windows XP Blocker Zachary ~1600 computers in student labs on TAMU campus (Open-Access Labs) GridMP software by United Devices (Austin, TX)

17 Data Mining of Results promiscuous binders
clusters of related compounds patterns of contacts within active site hydrogen-bonding interactions adjust weights of scoring function for unique properties of each site open/closed, hydrophobic/charged... ideas for active site variations development of pharmacophore search patterns

18 Current Screens in Sacchettini Lab
proteins related to tuberculosis (Mycobacterium) focus on unique pathways involved in dormancy/starvation glyoxylate shunt – slow-growth metabolic pathway cell-wall biosynthesis (unique mycolic acid layer in tb.) biosynthesis of amino acids/co-factors that humans get from diet isocitrate lyase malate synthase PcaA: mycolic acid cyclopropane synthase ACPS: acyl-carrier protein synthase InhA: enoyl-acyl reductase (target of isoniazid) KasB: fatty-acid synthase BioA: biotin (co-factor) synthase PGDH: phospho-glycerol dehydrogenase (serine biosynthesis) Related proteins in malaria, SARS, shigella

19

20

21

22

23 Conclusions Many opportunities for research in Structural Bioinformatics large datasets significant problems Provides challenges for machine learning drives development of novel methods, especially for dealing with noise, sampling biases, extraction of features... Requires inherently interdisciplinary approach training in biochemistry; knowledge of molecular interactions understanding chemical intuition; use of visualization tools insights about strengths and limitations of existing methods Requires collaboration to construct appropriate representations to enable learning algorithms to find patterns translate expectations about what is relevant, dependencies, smoothing, sources of noise...


Download ppt "Ioerger Lab – Bioinformatics Research"

Similar presentations


Ads by Google