1 The PIRSF Protein Classification System as a Basis for Automated UniProt Protein Annotation Darren A. Natale, Ph.D. Project Manager and Senior Scientist,

1 The PIRSF Protein Classification System as a Basis for Automated UniProt Protein Annotation Darren A. Natale, Ph.D. Project Manager and Senior Scientist, PIR Research Assistant Professor, GUMC Icobicobi 2004 Angra Dos Reis, RJ, Brasil

2 Major Topics UniProt Overview 1) PIRSF Protein Classification System 2) Family-Driven Protein Annotation3)

3 UniProt: Universal Protein Resource Central Resource of Protein Sequence and Function International Consortium: PIR, EBI, SIB Unifies PIR-PSD, Swiss-Prot, TrEMBL http://www.uniprot.org

4 UniProt Databases UniParc: Comprehensive Sequence Archive with Sequence History UniProt: Knowledgebase with Full Classification and Functional Annotation UniRef: Condensed Reference Databases for Sequence Search

5 UniParc An archive for tracking protein sequences Comprehensive: All published protein sequences Non-Redundant: Merge identical sequence strings Traceable: Versioned, with ‘Active’ or ‘Obsolete’ status tag Concise: no annotation of function, species, tissue, etc. 2.5 million unique entries from 6 million source-database entries

6 UniProt Knowledgebase Annotated: Fully manually-curated (Swiss-Prot section) and automatically- annotated based on family-driven rules (TrEMBL section) Cross-referenced: Links to over 50 external databases (classification, domain, structure, genome, functional, boutique) Non-redundant: Merge in a single record all protein products derived from a certain gene in a given species High Information Content: Isoform Presentation: Alternatively Spliced Forms, Proteolytic Cleavage, and Post-Translational Modification (each with FTid) Nomenclature: Gene/Protein Names (Nomenclature Committees) Family Classification and Domain Identification: InterPro and PIRSF Functional Annotation: Function, Functional Site, Developmental Stage, Catalytic Activity, Modification, Regulation, Induction, Pathway, Tissue Specificity, Sub-cellular Location, Disease, Process

7 UniProt Report ID & Accession Name & Taxon References Activity Pathway Disease Modified Swiss-Prot “NiceProt” view Cross-Refs

8 UniProt Report (II) Position-specific features: Active sites Binding sites Modified residues Sequence variations Additional Info Expanded detail

9 UniRef Databases Non-Redundant: Merge sequences and subsequences UniRef100: 100% sequence identity from all species, including sub-fragments Superset of Knowledgebase: Includes splice variants and selected UniParc sources (e.g. EnsEMBL, IPI, and patent data) Optimized: For Faster Searches using Reduced Data Sets UniRef90: 90% sequence identity (36% size reduction) UniRef50: 50% sequence identity (63% size reduction)

10 UniRef100 Report Splice variants Sub-fragments 100% sequence identity from all species, including sub-fragments Splice Variants as separate entries

11 Representative sequence UniRef90/50 Reports 90% Merged sequences likely have the same function 50% Phenylalanine hydroxylase & Tryptophan hydroxylase

12 UniProt Web Site Publicly available Dec. 15, 2003 Text/Sequence Searches against UniProt, UniRef, UniParc Links to Useful Tools Download UniProt, UniRefs FAQs and Information User Help/feedback forms http://www.uniprot.org

13 The Need for Classification This all works only if the system is optimized for annotation Most new protein sequences come from genome sequencing projects Many have unknown functions Large-scale functional annotation of these sequences based simply on BLAST best hit has pitfalls; results are far from perfect Problem: Highly curated and annotated protein classification system Solution: Automatic annotation of sequences based on protein families Systematic correction of annotation errors Name standardization in UniProt Functional predictions for uncharacterized proteins Facilitates:

14 Levels of Protein Classification LevelExampleSimilarityEvolution Class // Structural elementsNo relationships FoldTIM-BarrelTopology of backbonePossible monophyly Domain Superfamily AldolaseRecognizable sequence similarity (motifs); basic biochemistry Monophyletic origin FamilyClass I AldolaseHigh sequence similarity (alignments); biochemical properties Evolution by ancient duplications Orthologous group 2-keto-3-deoxy-6- phosphogluconate aldolase Orthology for a given set of species; biochemical activity; biological function Traceable to a single gene in LCA Lineage- specific expansion (LSE) PA3131 and PA3181 Paralogy within a lineageRecent duplication

15 Protein Evolution With enough similarity, one can trace back to a common origin Sequence changes What about these? Domain shuffling

16 PDT? CM/PDH? Consequences of Domain Shuffling PIRSF001500 CM (AroQ type) PDTACT PIRSF001501 CM (AroQ type) PIRSF006786 PDH PIRSF001499 PIRSF005547 PDH ACT PDTACT PIRSF001424 CM = chorismate mutase PDH = prephenate dehydrogenase PDT = prephenate dehydratase ACT = regulatory domain PDH? CM/PDT? CM? PDH CM (AroQ type)

17 Peptidase M22AcylphosphataseZnFYrdCZnF - --- Whole Protein = Sum of its Parts? On the basis of domain composition alone, biological function was predicted to be: ● RNA-binding translation factor ● maturation protease PIRSF006256 Actual function: ● [NiFe]-hydrogenase maturation factor, carbamoyl phosphate-converting enzyme

18 Classification Goals We strive to reconstruct the natural classification of proteins to the fullest possible extent BUT Domain shuffling rapidly degrades the continuity in the protein structure (faster than sequence divergence degrades similarity) THUS The further we extend the classification, the finer is the domain structure we need to consider SO We need to compromise between the depth of analysis and protein integrity OR Credit: Dr. Y. Wolf, NCBI

19 Domain Classification Allows a hierarchy that can trace evolution to the deepest possible level, the last point of traceable homology and common origin Can usually annotate only general biochemical function Whole-protein Classification Cannot build a hierarchy deep along the evolutionary tree because of domain shuffling Can usually annotate specific biological function (preferred to annotate individual proteins)  Can map domains onto proteins  Can classify proteins even when domains are not defined Complementary Approaches

20 The Ideal System… Comprehensive: each sequence is classified either as a member of a family or as an “orphan” sequence Hierarchical: families are united into superfamilies on the basis of distant homology, and divided into subfamilies on the basis of close homology Allows for simultaneous use of the whole protein and domain information (domains mapped onto proteins) Allows for automatic classification/annotation of new sequences when these sequences are classifiable into the existing families Expertly curated membership, family name, function, background, etc. Evidence attribution (experimental vs predicted)

21 PIRSF Classification System PIRSF: A network structure from superfamilies to subfamilies Reflects evolutionary relationships of full-length proteins Definitions: Homeomorphic Family: Basic Unit Homologous: Common ancestry, inferred by sequence similarity Homeomorphic: Full-length similarity & common domain architecture Network Structure: Flexible number of levels with varying degrees of sequence conservation; allows multiple parents Advantages: Annotate both general biochemical and specific biological functions Accurate propagation of annotation and development of standardized protein nomenclature and ontology

22 PIRSF Classification System A protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.

23 Variable Domain Architecture 1. Variable number of repeats Domain architecture can not be strictly followed in every case without making small and meaningless PIRSFs that preclude automatic member addition. Therefore, define a “core” and allow:

24 Variable Domain Architecture 2. Presence/absence of auxiliary domains Easily lost or acquired Usually small mobile domains Different versions of domain architecture arising many times Domain architecture can not be strictly followed in every case without making small and meaningless PIRSFs that preclude automatic member addition. Therefore, define a “core” and allow:

25 Variable Domain Architecture 3. Domain duplication Domain architecture can not be strictly followed in every case without making small and meaningless PIRSFs that preclude automatic member addition. Therefore, define a “core” and allow:

26 Classification Tool: BlastClust Curator-guided clustering Retrieve all proteins sharing a common domain Single-linkage clustering using BlastClust Fixed-length coverage enforces homeomorphicity Iterative procedure allows tree view

27 PIRSF Family Report (I) Curated family name Description of family Sequence analysis tools Phylogenetic tree and alignment view allows further sequence analysis Taxonomic distribution of PIRSF can be used to infer evolutionary history of the proteins in the PIRSF

28 PIRSF Family Report (II) Integrated value-added information from other databases Mapping to other protein classification databases

29 PIRSF Protein Classification provides a platform for UniProt protein annotation Improve Annotation Quality Annotate biological function of whole proteins Annotate uncharacterized hypothetical proteins (functional predictions helped by newly-detected family relationships) Correct annotation errors Improve under- or over-annotated proteins Standardize Protein Names in UniProt Site annotation Family-Driven Protein Annotation

30 Enhanced Annotations in UniProt UniProt IDOLD nameNEW (proposed) namePIRSF P38678Glucan synthase-1Cell wall assembly and cell proliferation coordinating proteinPIRSF017023 Q05632DecarboxylaseProbable cobalt-precorrin-6Y C(15)-methyltransferase [decarboxylating] PIRSF019019 P72117PAO substrain OT684 pyoverdine gene transcriptional regulator PvdS Thioesterase, type IIPIRSF000881 UniProt IDOLD nameNEW (proposed) namePIRSF P37185Hydrogenase-2 operon protein hybG[NiFe]-hydrogenase maturation chaperonePIRSF005618 P40360Hypothetical 65.6 kDa protein in SMC3- MRPL8 intergenic region Amino-acid acetyltransferase, fungal typePIRSF007892 Q98FY9CobT proteinAerobic cobaltochelatase, CobT subunitPIRSF031715 Corrections Upgraded underannotations Predicted functions for “hypothetical” proteins UniProt IDOLD nameNEW (proposed) namePIRSF Q57948Hypothetical protein MJ0528Predicted [NiFe]-hydrogenase-3-type complex Eha, membrane protein EhaA PIRSF005019 Q58527Hypothetical protein MJ1127Predicted metal-dependent hydrolasePIRSF004961 O28300Hypothetical protein AF1979Predicted nucleotidyltransferasePIRSF005928

31 Name Rules Hierarchy PIRSF Classification Name Site Rules Family-Driven Protein Annotation Objective: Optimize for protein annotation PIRSF Classification Name Reflects the function when possible Indicates the maximum specificity that still describes the entire group Standardized format Name tags: validated, tentative, predicted, functionally heterogeneous Hierarchy Subfamilies increase specificity (kinase -> sugar kinase -> hexokinase ) Name Rules Define conditions under which names propagate to individual proteins Enable further specificity based on taxonomy or motifs Names adhere to Swiss-Prot conventions (though we may make suggestions for improvement) Site Rules Define conditions under which features propagate to individual proteins

32 PIR Name Rules Monitor such variables to ensure accurate propagation Account for functional variations within one PIRSF, including: Lack of active site residues necessary for enzymatic activity Certain activities relevant only to one part of the taxonomic tree Evolutionarily-related proteins whose biochemical activities are known to differ Propagate other properties that describe function: EC, GO terms, misnomer info, pathway Name Rule types: “Zero” Rule Default rule (only condition is membership in the appropriate family) Information is suitable for every member “Higher-Order” Rule Has requirements in addition to membership Can have multiple rules that may or may not have mutually exclusive conditions

33 Example Name Rules Rule IDRule ConditionsPropagated Information PIRNR000881-1 PIRSF000881 member and vertebrates Name: S-acyl fatty acid synthase thioesterase EC: oleoyl-[acyl-carrier-protein] hydrolase (EC 3.1.2.14) PIRNR000881-2 PIRSF000881 member and not vertebrates Name: Type II thioesterase EC: thiolester hydrolases (EC 3.1.2.-) PIRNR025624-0PIRSF025624 member Name: ACT domain protein Misnomer: chorismate mutase Note the lack of a zero rule for PIRSF000881

34 Name Rule in Action at UniProt Current: Automatic annotations (AA) are in a separate field AA only visible from www.ebi.uniprot.org Future: Automatic name annotations will become DE line if DE line will improve as a result AA will be visible from all consortium-hosted web sites

35 Affiliation of Sequence: Homeomorphic Family or Subfamily (whichever PIRSF is the lowest possible node) No Yes Assign name from Name Rule 1 (or 2 etc) Protein fits criteria for any higher-order rule? No Yes Nothing to propagate Assign name from Name Rule 0 PIRSF has zero rule? Yes No Nothing to propagate Name Rule Propagation Pipeline Name rule exists?

36 PIR Site Rules Position-Specific Site Features: active sites binding sites modified amino acids Current requirements: at least one PDB structure experimental data on functional sites: CATRES database (Thornton) Rule Definition: Select template structure Align PIRSF seed members with structural template Edit MSA to retain conserved regions covering all site residues Build Site HMM from concatenated conserved regions

37 Propagate Information Feature annotation using controlled vocabulary Evidence attribution (experimental vs. computational prediction) Attribute sources and strengths of evidence Site Rule Algorithm Match Rule Conditions Membership Check (PIRSF HMM threshold) Ensures that the annotation is appropriate Conserved Region Check (site HMM threshold) Site Residue Check (all position-specific residues in HMMAlign)

38 Match Rule Conditions Only propagate site annotation if all rule conditions are met

39 Defined rules for annotation Site rules allow precise annotation of features for UniProt proteins within the PIRSF PIRSF Family Report (III)

40 Site Rules Feed Name Rules ? Functional variation within one PIRSF: binding sites with different specificity drive choice of applicable rule to ensure appropriate annotation Functional Site rule: tags active site, binding, other residue-specific information Functional Annotation rule: gives name, EC, other activity-specific information

41 PIR Team Dr. Cathy Wu, Director Curation team Dr. Winona BarkerDr. Darren NataleDr. CR Vinayaka Dr. Zhangzhi HuDr. Anastasia NikolskayaDr. Xianying Wei Dr. Raja MazumderDr. Sona VasudevanDr. Lai-Su Yeh Informatics team Dr. Leslie ArminskiYongxing Chen, M.S.Jian Zhang, M.S. Dr. Hsing-Kuo HuaSehee Chung, M.S.Amar Kalelkar Dr. Hongzhan HuangBaris Suzek, M.S. Students Jorge Castro-AlvearVincent HormosoRathi Thiagarajan Christina FangNatalia Petrova UniProt Collaborators Dr. Rolf Apweiler/EBIDr. Amos Bairoch/SIB

42 Curator’s Decision Maker

1 The PIRSF Protein Classification System as a Basis for Automated UniProt Protein Annotation Darren A. Natale, Ph.D. Project Manager and Senior Scientist,

Similar presentations

Presentation on theme: "1 The PIRSF Protein Classification System as a Basis for Automated UniProt Protein Annotation Darren A. Natale, Ph.D. Project Manager and Senior Scientist,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 The PIRSF Protein Classification System as a Basis for Automated UniProt Protein Annotation Darren A. Natale, Ph.D. Project Manager and Senior Scientist,

Similar presentations

Presentation on theme: "1 The PIRSF Protein Classification System as a Basis for Automated UniProt Protein Annotation Darren A. Natale, Ph.D. Project Manager and Senior Scientist,"— Presentation transcript:

Similar presentations

About project

Feedback