1 The PIRSF Protein Classification System as a Basis for Automated UniProt Protein Annotation Darren A. Natale, Ph.D. Project Manager and Senior Scientist,

Slides:



Advertisements
Similar presentations
Genome Annotation: A Protein-centric Perspective.
Advertisements

Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES:
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
The Protein Data Bank (PDB)
Protein Modules An Introduction to Bioinformatics.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Biological Data Integration July 22, 2003 GTL Data and Tools Workshop Gaithersburg, MD Cathy H. Wu, Ph.D. Professor of Biochemistry & Molecular Biology.
Proteomics and Protein Bioinformatics: Functional Analysis of Protein Sequences Anastasia Nikolskaya Assistant Professor (Research) Protein Information.
UniProt - The Universal Protein Resource
Claire O’Donovan EMBL-EBI. In UniProtKB, we aim to provide… o A high quality protein sequence database A non redundant protein database, with maximal.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
1 Protein Bioinformatics – Advances and Challenges Sona Vasudevan Peter McGarvey BY.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center
The aims of the Gene Ontology project are threefold: - to compile vocabularies to describe components, functions and processes - to produce tools to query.
? Functional Site rule: tags active site, binding, other residue- specific information Functional Annotation rule: gives name, EC, other activity- specific.
UniProt Non-redundant Reference Cluster (UniRef) Databases Swiss Institute of Bioinformatics (SIB) European Bioinformatics Institute (EMBL-EBI)
Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis.
Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center Washington, DC PIR: a comprehensive resource for functional.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Protein Information Resource Protein Information Resource, 3300 Whitehaven St., Georgetown University, Washington, DC Contact
Protein and RNA Families
PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.
Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES: ANNOTATION AND.
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Motif discovery and Protein Databases Tutorial 5.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. [many slides borrowed from various sources]
Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center PIRSF PROTEIN CLASSIFICATION SYSTEM AND SEQUENCE ANNOTATION.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Copyright OpenHelix. No use or reproduction without express written consent1.
? Functional Site rule: tags active site, binding, other residue- specific information Functional Annotation rule: gives name, EC, other activity- specific.
InterPro Sandra Orchard.
Proteomics and Protein Bioinformatics: Functional Analysis of Protein Sequences Anastasia Nikolskaya Assistant Professor (Research) Protein Information.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Protein families, domains and motifs in functional prediction May 31, 2016.
Protein databases Henrik Nielsen
FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES:
Demo: Protein Information Resource
UniProt: Universal Protein Resource
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
PIR: Protein Information Resource
Ensembl Genome Repository.
Protein Sequence Analysis - Overview -
PANTHER (Protein Analysis Through Evolutionary Relationships): Trees, Hidden Markov Models, Biological Annotations Paul Thomas, Ph.D. Division of Bioinformatics.
Protein Sequence Analysis - Overview -
Presentation transcript:

1 The PIRSF Protein Classification System as a Basis for Automated UniProt Protein Annotation Darren A. Natale, Ph.D. Project Manager and Senior Scientist, PIR Research Assistant Professor, GUMC Icobicobi 2004 Angra Dos Reis, RJ, Brasil

2 Major Topics UniProt Overview 1) PIRSF Protein Classification System 2) Family-Driven Protein Annotation3)

3 UniProt: Universal Protein Resource Central Resource of Protein Sequence and Function International Consortium: PIR, EBI, SIB Unifies PIR-PSD, Swiss-Prot, TrEMBL

4 UniProt Databases UniParc: Comprehensive Sequence Archive with Sequence History UniProt: Knowledgebase with Full Classification and Functional Annotation UniRef: Condensed Reference Databases for Sequence Search

5 UniParc An archive for tracking protein sequences Comprehensive: All published protein sequences Non-Redundant: Merge identical sequence strings Traceable: Versioned, with ‘Active’ or ‘Obsolete’ status tag Concise: no annotation of function, species, tissue, etc. 2.5 million unique entries from 6 million source-database entries

6 UniProt Knowledgebase Annotated: Fully manually-curated (Swiss-Prot section) and automatically- annotated based on family-driven rules (TrEMBL section) Cross-referenced: Links to over 50 external databases (classification, domain, structure, genome, functional, boutique) Non-redundant: Merge in a single record all protein products derived from a certain gene in a given species High Information Content: Isoform Presentation: Alternatively Spliced Forms, Proteolytic Cleavage, and Post-Translational Modification (each with FTid) Nomenclature: Gene/Protein Names (Nomenclature Committees) Family Classification and Domain Identification: InterPro and PIRSF Functional Annotation: Function, Functional Site, Developmental Stage, Catalytic Activity, Modification, Regulation, Induction, Pathway, Tissue Specificity, Sub-cellular Location, Disease, Process

7 UniProt Report ID & Accession Name & Taxon References Activity Pathway Disease Modified Swiss-Prot “NiceProt” view Cross-Refs

8 UniProt Report (II) Position-specific features: Active sites Binding sites Modified residues Sequence variations Additional Info Expanded detail

9 UniRef Databases Non-Redundant: Merge sequences and subsequences UniRef100: 100% sequence identity from all species, including sub-fragments Superset of Knowledgebase: Includes splice variants and selected UniParc sources (e.g. EnsEMBL, IPI, and patent data) Optimized: For Faster Searches using Reduced Data Sets UniRef90: 90% sequence identity (36% size reduction) UniRef50: 50% sequence identity (63% size reduction)

10 UniRef100 Report Splice variants Sub-fragments 100% sequence identity from all species, including sub-fragments Splice Variants as separate entries

11 Representative sequence UniRef90/50 Reports 90% Merged sequences likely have the same function 50% Phenylalanine hydroxylase & Tryptophan hydroxylase

12 UniProt Web Site Publicly available Dec. 15, 2003 Text/Sequence Searches against UniProt, UniRef, UniParc Links to Useful Tools Download UniProt, UniRefs FAQs and Information User Help/feedback forms

13 The Need for Classification This all works only if the system is optimized for annotation Most new protein sequences come from genome sequencing projects Many have unknown functions Large-scale functional annotation of these sequences based simply on BLAST best hit has pitfalls; results are far from perfect Problem: Highly curated and annotated protein classification system Solution: Automatic annotation of sequences based on protein families Systematic correction of annotation errors Name standardization in UniProt Functional predictions for uncharacterized proteins Facilitates:

14 Levels of Protein Classification LevelExampleSimilarityEvolution Class // Structural elementsNo relationships FoldTIM-BarrelTopology of backbonePossible monophyly Domain Superfamily AldolaseRecognizable sequence similarity (motifs); basic biochemistry Monophyletic origin FamilyClass I AldolaseHigh sequence similarity (alignments); biochemical properties Evolution by ancient duplications Orthologous group 2-keto-3-deoxy-6- phosphogluconate aldolase Orthology for a given set of species; biochemical activity; biological function Traceable to a single gene in LCA Lineage- specific expansion (LSE) PA3131 and PA3181 Paralogy within a lineageRecent duplication

15 Protein Evolution With enough similarity, one can trace back to a common origin Sequence changes What about these? Domain shuffling

16 PDT? CM/PDH? Consequences of Domain Shuffling PIRSF CM (AroQ type) PDTACT PIRSF CM (AroQ type) PIRSF PDH PIRSF PIRSF PDH ACT PDTACT PIRSF CM = chorismate mutase PDH = prephenate dehydrogenase PDT = prephenate dehydratase ACT = regulatory domain PDH? CM/PDT? CM? PDH CM (AroQ type)

17 Peptidase M22AcylphosphataseZnFYrdCZnF Whole Protein = Sum of its Parts? On the basis of domain composition alone, biological function was predicted to be: ● RNA-binding translation factor ● maturation protease PIRSF Actual function: ● [NiFe]-hydrogenase maturation factor, carbamoyl phosphate-converting enzyme

18 Classification Goals We strive to reconstruct the natural classification of proteins to the fullest possible extent BUT Domain shuffling rapidly degrades the continuity in the protein structure (faster than sequence divergence degrades similarity) THUS The further we extend the classification, the finer is the domain structure we need to consider SO We need to compromise between the depth of analysis and protein integrity OR Credit: Dr. Y. Wolf, NCBI

19 Domain Classification Allows a hierarchy that can trace evolution to the deepest possible level, the last point of traceable homology and common origin Can usually annotate only general biochemical function Whole-protein Classification Cannot build a hierarchy deep along the evolutionary tree because of domain shuffling Can usually annotate specific biological function (preferred to annotate individual proteins)  Can map domains onto proteins  Can classify proteins even when domains are not defined Complementary Approaches

20 The Ideal System… Comprehensive: each sequence is classified either as a member of a family or as an “orphan” sequence Hierarchical: families are united into superfamilies on the basis of distant homology, and divided into subfamilies on the basis of close homology Allows for simultaneous use of the whole protein and domain information (domains mapped onto proteins) Allows for automatic classification/annotation of new sequences when these sequences are classifiable into the existing families Expertly curated membership, family name, function, background, etc. Evidence attribution (experimental vs predicted)

21 PIRSF Classification System PIRSF: A network structure from superfamilies to subfamilies Reflects evolutionary relationships of full-length proteins Definitions: Homeomorphic Family: Basic Unit Homologous: Common ancestry, inferred by sequence similarity Homeomorphic: Full-length similarity & common domain architecture Network Structure: Flexible number of levels with varying degrees of sequence conservation; allows multiple parents Advantages: Annotate both general biochemical and specific biological functions Accurate propagation of annotation and development of standardized protein nomenclature and ontology

22 PIRSF Classification System A protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.

23 Variable Domain Architecture 1. Variable number of repeats Domain architecture can not be strictly followed in every case without making small and meaningless PIRSFs that preclude automatic member addition. Therefore, define a “core” and allow:

24 Variable Domain Architecture 2. Presence/absence of auxiliary domains Easily lost or acquired Usually small mobile domains Different versions of domain architecture arising many times Domain architecture can not be strictly followed in every case without making small and meaningless PIRSFs that preclude automatic member addition. Therefore, define a “core” and allow:

25 Variable Domain Architecture 3. Domain duplication Domain architecture can not be strictly followed in every case without making small and meaningless PIRSFs that preclude automatic member addition. Therefore, define a “core” and allow:

26 Classification Tool: BlastClust Curator-guided clustering Retrieve all proteins sharing a common domain Single-linkage clustering using BlastClust Fixed-length coverage enforces homeomorphicity Iterative procedure allows tree view

27 PIRSF Family Report (I) Curated family name Description of family Sequence analysis tools Phylogenetic tree and alignment view allows further sequence analysis Taxonomic distribution of PIRSF can be used to infer evolutionary history of the proteins in the PIRSF

28 PIRSF Family Report (II) Integrated value-added information from other databases Mapping to other protein classification databases

29 PIRSF Protein Classification provides a platform for UniProt protein annotation Improve Annotation Quality Annotate biological function of whole proteins Annotate uncharacterized hypothetical proteins (functional predictions helped by newly-detected family relationships) Correct annotation errors Improve under- or over-annotated proteins Standardize Protein Names in UniProt Site annotation Family-Driven Protein Annotation

30 Enhanced Annotations in UniProt UniProt IDOLD nameNEW (proposed) namePIRSF P38678Glucan synthase-1Cell wall assembly and cell proliferation coordinating proteinPIRSF Q05632DecarboxylaseProbable cobalt-precorrin-6Y C(15)-methyltransferase [decarboxylating] PIRSF P72117PAO substrain OT684 pyoverdine gene transcriptional regulator PvdS Thioesterase, type IIPIRSF UniProt IDOLD nameNEW (proposed) namePIRSF P37185Hydrogenase-2 operon protein hybG[NiFe]-hydrogenase maturation chaperonePIRSF P40360Hypothetical 65.6 kDa protein in SMC3- MRPL8 intergenic region Amino-acid acetyltransferase, fungal typePIRSF Q98FY9CobT proteinAerobic cobaltochelatase, CobT subunitPIRSF Corrections Upgraded underannotations Predicted functions for “hypothetical” proteins UniProt IDOLD nameNEW (proposed) namePIRSF Q57948Hypothetical protein MJ0528Predicted [NiFe]-hydrogenase-3-type complex Eha, membrane protein EhaA PIRSF Q58527Hypothetical protein MJ1127Predicted metal-dependent hydrolasePIRSF O28300Hypothetical protein AF1979Predicted nucleotidyltransferasePIRSF005928

31 Name Rules Hierarchy PIRSF Classification Name Site Rules Family-Driven Protein Annotation Objective: Optimize for protein annotation PIRSF Classification Name Reflects the function when possible Indicates the maximum specificity that still describes the entire group Standardized format Name tags: validated, tentative, predicted, functionally heterogeneous Hierarchy Subfamilies increase specificity (kinase -> sugar kinase -> hexokinase ) Name Rules Define conditions under which names propagate to individual proteins Enable further specificity based on taxonomy or motifs Names adhere to Swiss-Prot conventions (though we may make suggestions for improvement) Site Rules Define conditions under which features propagate to individual proteins

32 PIR Name Rules Monitor such variables to ensure accurate propagation Account for functional variations within one PIRSF, including: Lack of active site residues necessary for enzymatic activity Certain activities relevant only to one part of the taxonomic tree Evolutionarily-related proteins whose biochemical activities are known to differ Propagate other properties that describe function: EC, GO terms, misnomer info, pathway Name Rule types: “Zero” Rule Default rule (only condition is membership in the appropriate family) Information is suitable for every member “Higher-Order” Rule Has requirements in addition to membership Can have multiple rules that may or may not have mutually exclusive conditions

33 Example Name Rules Rule IDRule ConditionsPropagated Information PIRNR PIRSF member and vertebrates Name: S-acyl fatty acid synthase thioesterase EC: oleoyl-[acyl-carrier-protein] hydrolase (EC ) PIRNR PIRSF member and not vertebrates Name: Type II thioesterase EC: thiolester hydrolases (EC ) PIRNR PIRSF member Name: ACT domain protein Misnomer: chorismate mutase Note the lack of a zero rule for PIRSF000881

34 Name Rule in Action at UniProt Current: Automatic annotations (AA) are in a separate field AA only visible from Future: Automatic name annotations will become DE line if DE line will improve as a result AA will be visible from all consortium-hosted web sites

35 Affiliation of Sequence: Homeomorphic Family or Subfamily (whichever PIRSF is the lowest possible node) No Yes Assign name from Name Rule 1 (or 2 etc) Protein fits criteria for any higher-order rule? No Yes Nothing to propagate Assign name from Name Rule 0 PIRSF has zero rule? Yes No Nothing to propagate Name Rule Propagation Pipeline Name rule exists?

36 PIR Site Rules Position-Specific Site Features: active sites binding sites modified amino acids Current requirements: at least one PDB structure experimental data on functional sites: CATRES database (Thornton) Rule Definition: Select template structure Align PIRSF seed members with structural template Edit MSA to retain conserved regions covering all site residues Build Site HMM from concatenated conserved regions

37 Propagate Information Feature annotation using controlled vocabulary Evidence attribution (experimental vs. computational prediction) Attribute sources and strengths of evidence Site Rule Algorithm Match Rule Conditions Membership Check (PIRSF HMM threshold) Ensures that the annotation is appropriate Conserved Region Check (site HMM threshold) Site Residue Check (all position-specific residues in HMMAlign)

38 Match Rule Conditions Only propagate site annotation if all rule conditions are met

39 Defined rules for annotation Site rules allow precise annotation of features for UniProt proteins within the PIRSF PIRSF Family Report (III)

40 Site Rules Feed Name Rules ? Functional variation within one PIRSF: binding sites with different specificity drive choice of applicable rule to ensure appropriate annotation Functional Site rule: tags active site, binding, other residue-specific information Functional Annotation rule: gives name, EC, other activity-specific information

41 PIR Team Dr. Cathy Wu, Director Curation team Dr. Winona BarkerDr. Darren NataleDr. CR Vinayaka Dr. Zhangzhi HuDr. Anastasia NikolskayaDr. Xianying Wei Dr. Raja MazumderDr. Sona VasudevanDr. Lai-Su Yeh Informatics team Dr. Leslie ArminskiYongxing Chen, M.S.Jian Zhang, M.S. Dr. Hsing-Kuo HuaSehee Chung, M.S.Amar Kalelkar Dr. Hongzhan HuangBaris Suzek, M.S. Students Jorge Castro-AlvearVincent HormosoRathi Thiagarajan Christina FangNatalia Petrova UniProt Collaborators Dr. Rolf Apweiler/EBIDr. Amos Bairoch/SIB

42 Curator’s Decision Maker