Presentation is loading. Please wait.

Presentation is loading. Please wait.

Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

Similar presentations


Presentation on theme: "Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center"— Presentation transcript:

1

2 Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center www.uniprot.orgwww.uniprot.org http://pir.georgetown.edu/http://pir.georgetown.edu/ COMPLEMENTING GENE ONTOLOGY WITH PIRSF CLASSIFICATION-BASED PROTEIN ONTOLOGY

3 2 Why Protein Classification? Automatic annotation of protein sequences based on protein families (propagation of annotation) Systematic correction of annotation errors Protein name standardization in UniProt Functional predictions for uncharacterized protein families

4 3 PIRSF Classification System PIRSF: A network structure with hierarchies from Superfamilies to Subfamilies reflects evolutionary relationships of full-length proteins Definitions: Basic unit = Homeomorphic Family Homologous (Common Ancestry): Inferred by sequence similarity Homeomorphic: Full-length sequence similarity and common domain architecture Network Structure: Flexible number of levels with varying degrees of sequence conservation Advantages: Annotation of both generic biochemical and specific biological functions Accurate propagation of annotation and development of standardized protein nomenclature and ontology

5 4 Levels of protein classification LevelExampleSimilarityEvolution FoldTIM-BarrelTopology of folded backbonePossible monophyly Domain Superfamily AldolaseRecognizable sequence similarity (motifs); basic biochemistry Monophyletic origin Class I AldolaseHigh sequence similarity (alignments); biochemical properties Evolution by ancient duplications Orthologous group 2-keto-3-deoxy-6- phosphogluconate aldolase Orthology for a given set of species; biochemical activity; biological function Origin traceable to a single gene in LCA Lineage- specific expansion (LSE) PA3131 and PA3181 Paralogy within a lineageEvolution by recent duplication and loss

6 5 PIRSF Classification System A protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.

7 6 PIRSF Classification System A protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains. SF500001: stimulates trophoblast migration SF500002: stimulates proliferation of prostate cancer cells SF500003: anti-proliferative and pro-apoptotic effects on cancer cells SF500004: inhibitor of IGF SF500005: stimulates bone formation SF500006: inhibitor of IGF-II

8 7 Creation and curation of PIRSFs UniProt proteins Preliminary Homeomorphic Families Orphans Curated Homeomorphic Families Final Homeomorphic Families Add/remove members Name, refs, abstract, domain arch. Automatic clustering Computer- assisted Manual Curation Automatic Procedure Unassigned proteins Automatic placement Create hierarchies (superfamilies/subfamilies) Map domains on Families Merge/split clusters New proteins Protein name rule/site rule Computer- Generated (Uncurated) Clusters (36,000 PIRSFs) Preliminary Curation (5,000 PIRSFs) Membership Signature Domains Full Curation (1,300 PIRSFs) Family Name with evidence tag Description, Bibliography Build and test HMMs

9 8 PIRSF-Based Protein Annotation in UniProt Rule-Based annotation system using curated PIRSFs Site Rules (PIRSR): Position-Specific Site Features (active sites, binding sites, m odified sites, other functional sites ) Name Rules (PIRNR): transfer name from PIRSF to individual proteins (define a subgroup if necessary) Protein Name (may differ from family name), synonyms, acronyms EC Misnomers GO Terms ( homeomorphic family-based, propagatable GO annotation) Function UniProt is developing protein name standards and guidelines Classification of proteins into families provides a convenient and accurate mechanism to propagate curated information to individual protein members

10 9 PIRSF-Based Protein Ontology PIRSF family hierarchy is based on evolutionary relationships Standardized PIRSF family names Network structure (in DAG) for PIRSF family classification system

11 10 PIRSF to GO Mapping PIRSF to GO mapping provides a link between GO concepts and protein objects Mapped 5500 curated PIRSF homeomorphic families and subfamilies to the GO hierarchy Superimpose GO and PIRSF hierarchies Bidirectional display (GO-centric or PIRSF-centric views) DynGO viewer Hongfang Liu, University of Maryland

12 11 Protein Ontology Can Complement GO Expanding a Node Identification of GO subtrees that need expansion if GO concepts are too broad ~ 67% of curated PIRSF families and subfamilies map to GO leaf nodes Among these, 2209 PIRSFs have shared GO leaf nodes (many PIRSFs to 1 GO leaf) Example: PIRSF001969 vs PIRSF018239 and PIRSF036495 : High- vs low-affinity IGF binding Identification of missing GO nodes

13 12 Protein Ontology Can Complement GO Identification of Missing GO Nodes (higher levels)

14 13 Protein Ontology Can Complement GO Mechanism to examine the relationships between the three GO ontologies based on the shared annotations at different protein family levels Example: molecular function “ estrogen receptor activity ” and biological process “ signal transduction ”, “ estrogen receptor signaling pathway ” Linking Function, Biological Process, and Cellular Component through a Protein Object Based on Protein Annotations

15 14 PIRSF Protein Classification: a link between GO and protein objects Annotation Quality Annotation of biological function of whole proteins Annotation of uncharacterized “hypothetical” proteins Correction of annotation errors and underannotations Standardization of Protein Names PIRSF to GO mapping provides a link between GO sub- ontologies and protein objects

16 15 PIRSF-based Protein Ontology Can Complement GO Identification of GO subtrees that need expansion if GO concepts are too broad Comprehensive classification of related protein families in PIRSF can help in identification of missing GO nodes when entire groups of PIRSF superfamilies or families cannot be mapped to existing GO terms Mechanism to examine the relationships between the three GO ontologies (molecular function, biological process, and cellular component), as well as between GO concepts, based on the shared annotations at different protein family levels

17 16 Acknowledgements Hongfang Liu, University of Maryland Judith Blake, The Jackson Laboratory Dr. Cathy Wu, Director Protein Classification team Dr. Winona Barker Dr. Lai-Su Yeh Dr. Anastasia Nikolskaya Dr. Darren Natale Dr. Zhangzhi Hu Dr. Raja Mazumder Dr. CR Vinayaka Dr. Xianying Wei Dr. Sona Vasudevan Informatics team Dr. Hongzhan Huang Baris Suzek, M.S. Sehee Chung, M.S. Dr. Leslie Arminski Dr. Hsing-Kuo Hua Yongxing Chen, M.S. Jing Zhang, M.S. Amar Kalelkar Students Christina Fang Vincent Hormoso Natalia Petrova Jorge Castro-Alvear PIR Team http://pir.georgetown.edu/ UniProt (SwissProt, TrEMBL, PIR) www.uniprot.org


Download ppt "Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center"

Similar presentations


Ads by Google