Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004.

Slides:



Advertisements
Similar presentations
Transmembrane Protein Topology Prediction Using Support Vector Machines Tim Nugent and David Jones Bioinformatics Group, Department of Computer Science,
Advertisements

Functional Site Prediction Selects Correct Protein Models Vijayalakshmi Chelliah Division of Mathematical Biology National Institute.
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Protein Structure Database Introduction Database of Comparative Protein Structure Models ModBase 生資所 g 詹濠先.
50%, guessing 100%, all correct Accuracy = Figure 2 Predictive Accuracy of SMO algorithm using each attribute separately Prediction of catalytic residues.
Profiles for Sequences
Introduction to Bioinformatics
Structural bioinformatics
Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity Nicholas M. Luscombe and Janet M. Thornton JMB (2002)
Heuristic alignment algorithms and cost matrices
Protein structure (Part 2 of 2).
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
The Protein Data Bank (PDB)
Introduction to bioinformatics
Protein Modules An Introduction to Bioinformatics.
Exploiting Structural and Comparative Genomics to Reveal Protein Functions  How many domain families can we find in the genomes and can we predict the.
Similar Sequence Similar Function Charles Yan Spring 2006.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Structural Genomics: Case studies in assigning function from structure ? ? ? ? ? ? ? ? ? ? ? ? James D Watson
Department of Biochemistry
Current Status of Homology Modeling Using MCSG Structures 319 MCSG structures in PDB have over 400,000 sequence homologues. These structures represent.
Protein Tertiary Structure Prediction
Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.
Protein Sequence Alignment and Database Searching.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
From Structure to Function
Modelling binding site with 3DLigandSite Mark Wass
Exploiting Structural and Comparative Genomics to Reveal Protein Functions  Predicting domain structure families and their domain contexts  Exploring.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
MACiE – a Database of Enzyme Reaction Mechanisms Janet Thornton EMBL-EBI July 2006.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
Ozgur Ozturk, Ahmet Sacan, Hakan Ferhatosmanoglu, Yusu Wang The Ohio State University LFM-Pro: a tool for mining family-specific sites in protein structure.
NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Comparing and Classifying Domain Structures
Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Protein Folding & Biospectroscopy Lecture 6 F14PFB David Robinson.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Protein Tertiary Structure Prediction Structural Bioinformatics.
EBI is an Outstation of the European Molecular Biology Laboratory. A web based integrated search service to understand ligand binding and secondary structure.
The Chemistry of Protein Catalysis John Mitchell University of St Andrews.
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
Bio/Chem-informatics
Demo: Protein Information Resource
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
Chapter Three: Enzymes
Prediction of Protein Structure and Function on a Proteomic Scale
Sequence Based Analysis Tutorial
Protein Sequence Analysis - Overview -
Sequence Based Analysis Tutorial
Protein Sequence Analysis - Overview -
Homology Modeling.
Protein structure prediction.
Presentation transcript:

Genome Annotation of Protein Function using Structural Data: Catalytic Residue Information Janet Thornton European Bioinformatics Institute ISMB/ECCB 2004 Glasgow

From Structure to Functional Annotation

From Structure To Biochemical Function Gene Protein 3D Structure Function Given a protein structure: n Where is the functional site? n What is the multimeric state of the protein? n Which ligands bind to the protein? n What is biochemical function?

Automated Structure Comparison n The most powerful method for assigning function from structure is global or partial 3D structure comparison (e.g. Dali, SSAP; SSM) n Hidden Markov Models derived from structural domains can often recognise distant relatives from sequence

Predicting Binding Site Binding-site analysis: cutA Most likely binding site Surface clefts Residue conservation Conserved surface patches

Identifying Binding Site Function Using Motifs - 3D enzyme active site structural motifs (Craig Porter) - Catalytic Site Atlas - Identification of catalytic residues (Gail Bartlett, Alex Gutteridge) - Metal binding sites (Malcolm MacArthur) - Binding site features (Gareth Stockwell) - Automatically generated templates of ligand-binding and - DNA binding motifs (Sue Jones, Hugh Shanahan) - Reverse templates (Roman Laskowski) JESS – fast template search algorithm (Jonathan Barker)

Using information on Catalytic Residues derived from Structures n Catalytic Site Atlas n Using info for annotation of enzymes in genomes n 3D Templates

The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Craig T. Porter, Gail J. Bartlett, and Janet M. Thornton Nucl. Acids. Res : D129-D133.

Catalytic Site Information Enzyme reports from primary literature information -lactamase Class A EC: PDB: 1btl Reaction: -lactam + H2O -amino acid Active site residues: S70, K73, S130, E166 Plausible mechanism:

n Annotates catalytic residues in the PDB n Based on a dataset of 514 enzyme families u Representative catalytic site for each family u Homologues assigned by Psi-BLAST u Limited substitution allowed. u Homologues updated monthly. n Literature references n Data also available via MSDsite

CSA Coverage 512 Representative Sites 9075 PDB Files Catalytic Sites Class In CSA In PDB E.C Oxidoreductases. 194 / 271 E.C Transferases. 151 / 280 E.C Hydrolases. 221 / 421 E.C Lyases. 96 / 122 E.C Isomerases. 44 / 63 E.C Ligases. 33 / 58 Total 739 / 1215 (Current 512 Enzyme Dataset)

Metal Site Atlas n Annotates Metal Sites in PDB n Similar to CSA database n Searchable by: u PDB code u Swiss-Prot code u Homologues. n Dataset includes: u Copper, Zinc, Calcium, Iron (excl. hemes), Cobalt, Magnesium, Manganese, Molybdenum, Nickel and Tungsten.

Metal Site Atlas Contents Templates: 46 Cu 195 Zn 270 Ca 83 Fe 6 Co 86 Mg 45 Mn 10 Mo 7 Ni 4 W 752 Total Templates Sites in MSA: 6301 PDB Files Metal Binding Sites

Comparison of CSA v1.0 with Swiss-Prot and PDB Site Annotations

CSA v1.0 - Literature EC Wheels CSA v1.0 – plus homologues

iCSA: Using Functional Residue Conservation to Improve Function Annotation n Starting with over 500 enzymes from the CSA, with EC numbers and high quality catalytic site information n Retrieve homologues from Biopendium TM n Align homologues with query enzyme, using u PSI-BLAST profiles u CLUSTAL W multiple alignments u Smith and Waterman pairwise alignments n Check for conservation of catalytic residues n If all residues are conserved, assign EC from annotated enzyme to homologue u Also deals with mutation, etc. if necessary

Testing the iCSA Method n Searches with 517 CSA sites retrieved over Swiss- Prot sequences within four iterations of PSI-BLAST n These were assigned three digit EC numbers using the iCSA method n The assigned EC numbers were then compared with the EC annotation given in the Swiss-Prot database n The accuracy of EC assignment was compared with the accuracy achieved using sequence homology (i.e. PSI- BLAST) CSA query enzyme Swiss-Prot Homologues iCSA filtered homologues Homology search Function assignment by homology Function assignment using CSA iCSA filter

EC Assignment Accuracy CSA Correct EC assigned An EC assigned

Improvement in EC Assignment Accuracy, Compared with Homology Alone 48% overall Accuracy iCSA -Accuracy Homology Accuracy Homology

iCSA vs. Sequence Homology Alone n The accuracy of EC assignment is improved by using iCSA u The improvement in accuracy is more pronounced with more distant homologues: from 7% at iteration 1 to 88% at iteration 4 u Overall, EC assignment accuracy is improved by 48% u Overall, EC assignment accuracy using iCSA is 86% (vs. 58% using sequence homology alone)

iCSA EC Coverage % coverage PSI-BLAST iteration Correct EC assigned Homologues with correct EC

iCSA vs. Sequence Homology Alone n iCSA coverage is 78% overall u The iCSA is right to reject many of these homologues even though they have the same EC as the CSA site used as the query F EC covered by more than one specific catalytic site F Incorrect EC assignment in Swiss-Prot u But misaligned sequences are also possible, especially with more distant homologues

iCSA Correctly Rejects Homologues n The iCSA accuracy with the CSA trypsin site is 100% n The benefits of the iCSA method can be seen in the homologues not assigned the trypsin EC n Trypsin homologues that do not pass the catalytic residue checks in iCSA include several haptoglobin proteins u Haptoglobin is closely related to trypsin, but is a known non- enzyme n Sequence homology alone would assign these haptoglobin sequences the trypsin EC, but iCSA can correctly identify that the residues for catalysis are not present

Human Genome Annotation n We applied iCSA to the human ENSEMBL sequence database n The iCSA directly annotated 2064 sequences with an EC u Only 64% of these have an equivalent Swiss-Prot protein F at least 90% pairwise sequence identity and a difference in length of less than 10% of the shorter sequence u So 743 sequence annotations have been efficiently expanded n A further 2257 homologues did not have a conserved site and an EC was not assigned u 73% of the equivalent Swiss-Prot sequences had an alternative EC number to the iCSA query u Homology-based functional assignments in these cases could prove incorrect

Summary n iCSA methodology developed n Database currently contains: u 7013 PDBs (11710 chains) u Swiss-Prot sequences u 4321 Human ENSEMBL sequences u 4227 Mouse ENSEMBL sequences

Poster E-37 Session 1 (Sunday)

3D Templates to Characterise Functional Sites Template searches (189 enzyme active site templates) (~600 Metal binding site templates)

GARTfase Cholesterol oxidase IIAglc histidine kinase Carbamoylsarcosine amidohhydrase Dihydrofolate reductase Ser-His-Asp catalytic triad … Database of enzyme active site templates 189 templates

MCSG structure BioH – unknown function involved in biotin synthesis in E.coli An example Structure: Rossmann fold, hence many structural homologues Expected to be an enzyme Sequence contains two Gly-X-Ser-X-Gly motifs typical of acyltransferases and thioesterases

Ser-His-Asp catalytic triad of the lipases with rmsd=0.28Å (template cut-off is 1.2Å) CSA template search One very strong hit Experimentally confirmed by hydrolase assays Novel carboxylesterase acting on short acyl chain substrates

Generation of 3D Active Site Templates for Enzymes in the Catalytic Site Atlas Gail J Bartlett *, James W Torrance, Craig T Porter, Jonathan A Barker, Alex Gutteridge, Malcolm W MacArthur, Janet M Thornton EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK * Centre For Bioinformatics, Biochemistry Building, Imperial College London, South Kensington Campus, London SW7 2AZ, UK 1. Introduction Structural templates can be used to search protein structures for particular patterns of residues, such as catalytic sites. Structural templates are thus a tool for predicting protein function. There are many methods that employ structural templates, but no reliable template libraries. The Catalytic Site Atlas 1 is a database of catalytic residues within proteins of known structure. This information can be used to create a template library. We hope to use this library to uncover cases of convergent evolution and to predict function from structure. 2. Objectives To use the Catalytic Site Atlas to create a library of structural templates representing catalytic sites To assess the effectiveness of these templates for identifying proteins with a particular catalytic function EBI Home Pagehttp:// 4. Results No correlation between RMSD of template atoms and percentage pairwise sequence identity found within homologous enzyme families Majority of RMSD values between templates from homologous family members were below 1Å Templates distinguish related enzymes well in most families, with > 75% of relatives having RMSDs better than that of any random match. Some families showed wide variation of catalytic residue geometry, making prediction difficult. Templates based on C / C atoms performed slightly better than those which used functional atoms. 3. Methods Template generation and analysis of active site geometry Two types of template were created (atoms used are highlighted in ball form): Templates within the same homologous enzyme family were superposed and the distribution of RMSDs examined. Assessing template effectiveness The Jess template-matching method 2 was used to query all the templates against a non-redundant subset of the PDB. Hits were scored using both RMSD and a statistical significance measure. The effectiveness of hits was measured by comparing scores of hits between relatives with scores from random hits identified in the PDB. FTPftp.ebi.ac.uk Telephone+44(0) Fax+44(0) C and C atoms Three functional atoms 7. Conclusions Structural templates representing catalytic sites effectively distinguish between family members and random hits. The lack of correlation between RMSD and pairwise % sequence identity within families is a result of catalytic residue position being affected not only by evolutionary divergence, but also by factors such as presence or absence of ligand, ligand type, and possible functional variation. 8. References 1. Porter, C.T., Bartlett, G.J., Thornton, J.M. (2004) The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res 32, D Barker, J.A., Thornton, J.M. (2003) An algorithm for constraint-based structural template matching: application to 3D templates with statistical analysis. Bioinformatics 19, A bad template - fructose 1,6-bisphosphatase It is difficult to construct a sensitive template for fructose 1,6-bisphosphatase because one catalytic residue is on a flexible loop that moves when AMP binds at an allosteric site. 5. A good template - aldolase A Aldolase A relatives superpose well (below right) and there is a clear separation between these and random hits to PDB (below left). Superposition of homologous family templates Open form Closed form Catalytic residues Flexible loop AMP Loop closed Structures of open form Structures of closed form ° Distribution of RMSDs of hits to aldolase template (based on PDB 1ald) ° Distribution of RMSDs of hits to fructose 1,6-bisphosphatase template (based on PDB 1eyi) Poster Number I76 - Monday

Template databases n HAND CURATED u Enzyme active sites (PROCAT) – 189 templates F Currently being extended u Metal-binding sites – 600 templates n AUTOMATED u Ligand-binding sites – 10,000 templates u DNA-binding sites – 800 templates

Automatically generated templates 1. Ligand-binding templates b. Identify residues interacting with ligand (via H-bonds or non- bonded contacts) c. Templates generated from overlapping local groups of 3- residue clusters a. For each Het Group in the PDB extract a non-homologous data set of proteins binding that Het Group d. Gives over 10,000 ligand-binding templates

Automatically generated templates 2. DNA-binding templates b. Identify residues interacting with DNA/RNA (via H-bonds or non-bonded contacts) c. Templates generated from overlapping local groups of 3- residue clusters a. Extract a non-homologous data set of DNA/RNA-binding proteins from the PDB d. Gives over 800 DNA/RNA- binding templates

Problems with automated template methods WITH A LARGE NUMBER OF TEMPLATES: Too many hits (usually tens, and often hundreds) Use of rmsd rarely discriminates true from false positives Local distortion in structure may give a large rmsd Top hit rarely the correct hit – even in obvious cases

An example PDB code: 1hsk UDP-N-acetylenolpyruvoylglucosamine reductase (MURB) E.C Contains the 3D template that characterises this enzyme class Sequence identity to templates representative structure (1mbb) is 28% Ser Arg Glu

Enzyme active site templates Hits for 1hsk 102. E.C Å UDP-N-acetylmuramate dehydrogenase Hit E.C number Rmsd Enzyme 1. E.C Å Acyl-CoA dehydrogenase 2. E.C Å Tryptophan synthase α-subunit 3. E.C Å Glycosyl hydrolases, family E.C Å Glycosyl hydrolases, family E.C Å Fructose-bisphosphate aldolase (class I) … … … 386. … 3.94Å … Arg Glu Ser rmsd=2.19Å

Template structure – 1mbb Comparison of template environments Arg Glu Ser Similar residues in neighbourhood: Target structure – 1hsk

Template structure – 1mbb Comparison of template environments Arg Glu Ser Match to template: Target structure – 1hsk

Template structure – 1mbb Comparison of template environments Arg Glu Ser Match to template: Target structure – 1hsk

Score equivalent grid-points using Dayhoff matrix and taking voids into account Environment similarity score Template structure Slices through 10Å sphere centred on template match 1mbb Target structure 1hsk Total similarity score obtained from sum of all grid-point scores

Results for 1hsk 1. E.C UDP-N-acetylmuramate dehydrogenase 2. E.C Chitinase A chitodextrinase 1,4-beta-poly-N-acetylglucosaminidase coly-beta-glucosaminidase 3. E.C Turkey lysozyme 4. E.C Hen lysozyme 5. E.C Aspartylglucosylaminidase 6. E.C Glucan 1,4-alpha-glucosidase Hit E.C number Rmsd Score Enzyme

Residue conservation Hit E.C number Rmsd Signif Enzyme 1. E.C Å 98.3% UDP-N-acetylmuramate dehydrogenase 2. E.C Å 98.3% Penicillin acylase 3. E.C Å 98.3% Topoisomerase Ia/II 4. E.C Å 98.3% Mandelate racemase 5. E.C Å 97.8% Topoisomerase Ia/II … … … … Rank template hits according to conservation scores of the matched residues

Residue conservation and cleft proximity Hit E.C number Rmsd Signif Enzyme 2. E.C Å 98.3% UDP-N-acetylmuramate dehydrogenase 1. E.C Å 98.4% Mandelate racemase 3. E.C Å 98.3% Penicillin acylase 4. E.C Å 98.3% Topoisomerase Ia/II 5. E.C Å 97.8% Topoisomerase Ia/II … … … … Rank by conservation and proximity to proteins two largest clefts

1hsk Reverse templates 1hsk … 3-residue templates

Template structure – 1mbb Comparison of template environments Identical residues in neighbourhood: Target structure – 1hsk

Reverse templates Search each template vs PDB (or representative subset) Typically get templates from a single structure Non-homologous dataset of 2,500 protein chains Focused search (eg top DALI hits) Locate known PDB entries with closest local similarity Program called: the Protein SiteSeer Times for search vs 2,500 set JESS – 30 minutes SiteSeer – 3 hours

ProFunc – function from 3D structure Homologous sequences of known function Binding site identification and analysis Homologous structures of known function Functional sequence motifs Q-x(3)-[GE]-x-C-[YW]-x(2)-[STAGC] Enzyme active site 3D-templates HTH-motifs Electrostatics Surface comparison … etc DNA-, ligand- binding and reverse templates Residue conservation analysis

Acknowledgements CSA: Craig Porter, Gail Bartlett, Alex Gutteridge, Malcolm MacArthur (EBI), Neera Borkakoti Genome Annotation: Ruth Spriggs, Richard George, Mark Swindells, B. Al-Lazikhani (Inpharmatica) ProFunc: Roman Laskowski; James Watson (EBI)