Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

Slides:



Advertisements
Similar presentations
Using Ontology Reasoning to Classify Protein Phosphatases K.Wolstencroft, P.Lord, L.tabernero, A.brass, R.stevens University of Manchester.
Advertisements

Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR) Hu ZZ 1, Mani I 2, Liu H 3, Vijay-Shanker K 4, Hermoso.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Ontology annotation: mapping genomic regions biological function Paul D Thomas, Huaiyu Mi and Suzanna Lewis.
Pfam(Protein families )
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
50%, guessing 100%, all correct Accuracy = Figure 2 Predictive Accuracy of SMO algorithm using each attribute separately Prediction of catalytic residues.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES:
COG and GO tutorial.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Internet tools for genomic analysis: part 2
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Biological Data Integration July 22, 2003 GTL Data and Tools Workshop Gaithersburg, MD Cathy H. Wu, Ph.D. Professor of Biochemistry & Molecular Biology.
Proteomics and Protein Bioinformatics: Functional Analysis of Protein Sequences Anastasia Nikolskaya Assistant Professor (Research) Protein Information.
Protein Ontology: Addressing the need for precision in representing protein networks Darren A. Natale, Ph.D. Protein Science Team Lead, PIR Research Assistant.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Automatic methods for functional annotation of sequences Petri Törönen.
1 Protein Bioinformatics – Advances and Challenges Sona Vasudevan Peter McGarvey BY.
RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation Hu ZZ 1, Yuan X 1, Torii M 2, Vijay-Shanker K 3, and Wu CH 1 1 Protein Information.
? Functional Site rule: tags active site, binding, other residue- specific information Functional Annotation rule: gives name, EC, other activity- specific.
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis.
Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center Washington, DC PIR: a comprehensive resource for functional.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Chapter 26 Phylogeny and the Tree of Life
Protein Ontology (PRO) Amherst, NY May 15, 2013 Cathy H. Wu, Ph.D. Director, Protein Information Resource (PIR) Edward G. Jefferson Chair and Director.
Cell Signaling Ontology Takako Takai-Igarashi and Toshihisa Takagi Human Genome Center, Institute of Medical Science, University of Tokyo.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein Information Resource Protein Information Resource, 3300 Whitehaven St., Georgetown University, Washington, DC Contact
Protein and RNA Families
1 The PIRSF Protein Classification System as a Basis for Automated UniProt Protein Annotation Darren A. Natale, Ph.D. Project Manager and Senior Scientist,
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES: ANNOTATION AND.
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center PIRSF PROTEIN CLASSIFICATION SYSTEM AND SEQUENCE ANNOTATION.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Scope of the Gene Ontology Vocabularies. Compile structured vocabularies describing aspects of molecular biology Describe gene products using vocabulary.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Globins. Globin diversity Hemoglobins ( , etc) Myoglobins (muscle) Neuroglobins (in CNS) Invertebrate globins Leghemoglobins flavohemoglobins.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
S. pombe Unicellular archiascomycete Diverged from S. cerevisiae Ma Size ~14 Mb, 3 chromosomes No synteny Data stored in GeneDB.
? Functional Site rule: tags active site, binding, other residue- specific information Functional Annotation rule: gives name, EC, other activity- specific.
InterPro Sandra Orchard.
Proteomics and Protein Bioinformatics: Functional Analysis of Protein Sequences Anastasia Nikolskaya Assistant Professor (Research) Protein Information.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Gene3D, Orthology and Homology-Based Inheritance of Protein-Protein Interactions Corin Yeats
Protein families, domains and motifs in functional prediction May 31, 2016.
` Comparison of Gene Ontology Term Annotations Between E.coli K12 Databases REDDYSAILAJA MARPURI WESTERN KENTUCKY UNIVERSITY.
FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES:
Demo: Protein Information Resource
P-POD-PANTHER: update
Genome Annotation Continued
PIR: Protein Information Resource
Literature Data Mining and Protein Ontology Development
Chapter 25 Phylogeny and the Tree of Life
PANTHER (Protein Analysis Through Evolutionary Relationships): Trees, Hidden Markov Models, Biological Annotations Paul Thomas, Ph.D. Division of Bioinformatics.
Presentation transcript:

Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center COMPLEMENTING GENE ONTOLOGY WITH PIRSF CLASSIFICATION-BASED PROTEIN ONTOLOGY

2 Why Protein Classification? Automatic annotation of protein sequences based on protein families (propagation of annotation) Systematic correction of annotation errors Protein name standardization in UniProt Functional predictions for uncharacterized protein families

3 PIRSF Classification System PIRSF: A network structure with hierarchies from Superfamilies to Subfamilies reflects evolutionary relationships of full-length proteins Definitions: Basic unit = Homeomorphic Family Homologous (Common Ancestry): Inferred by sequence similarity Homeomorphic: Full-length sequence similarity and common domain architecture Network Structure: Flexible number of levels with varying degrees of sequence conservation Advantages: Annotation of both generic biochemical and specific biological functions Accurate propagation of annotation and development of standardized protein nomenclature and ontology

4 Levels of protein classification LevelExampleSimilarityEvolution FoldTIM-BarrelTopology of folded backbonePossible monophyly Domain Superfamily AldolaseRecognizable sequence similarity (motifs); basic biochemistry Monophyletic origin Class I AldolaseHigh sequence similarity (alignments); biochemical properties Evolution by ancient duplications Orthologous group 2-keto-3-deoxy-6- phosphogluconate aldolase Orthology for a given set of species; biochemical activity; biological function Origin traceable to a single gene in LCA Lineage- specific expansion (LSE) PA3131 and PA3181 Paralogy within a lineageEvolution by recent duplication and loss

5 PIRSF Classification System A protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.

6 PIRSF Classification System A protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains. SF500001: stimulates trophoblast migration SF500002: stimulates proliferation of prostate cancer cells SF500003: anti-proliferative and pro-apoptotic effects on cancer cells SF500004: inhibitor of IGF SF500005: stimulates bone formation SF500006: inhibitor of IGF-II

7 Creation and curation of PIRSFs UniProt proteins Preliminary Homeomorphic Families Orphans Curated Homeomorphic Families Final Homeomorphic Families Add/remove members Name, refs, abstract, domain arch. Automatic clustering Computer- assisted Manual Curation Automatic Procedure Unassigned proteins Automatic placement Create hierarchies (superfamilies/subfamilies) Map domains on Families Merge/split clusters New proteins Protein name rule/site rule Computer- Generated (Uncurated) Clusters (36,000 PIRSFs) Preliminary Curation (5,000 PIRSFs) Membership Signature Domains Full Curation (1,300 PIRSFs) Family Name with evidence tag Description, Bibliography Build and test HMMs

8 PIRSF-Based Protein Annotation in UniProt Rule-Based annotation system using curated PIRSFs Site Rules (PIRSR): Position-Specific Site Features (active sites, binding sites, m odified sites, other functional sites ) Name Rules (PIRNR): transfer name from PIRSF to individual proteins (define a subgroup if necessary) Protein Name (may differ from family name), synonyms, acronyms EC Misnomers GO Terms ( homeomorphic family-based, propagatable GO annotation) Function UniProt is developing protein name standards and guidelines Classification of proteins into families provides a convenient and accurate mechanism to propagate curated information to individual protein members

9 PIRSF-Based Protein Ontology PIRSF family hierarchy is based on evolutionary relationships Standardized PIRSF family names Network structure (in DAG) for PIRSF family classification system

10 PIRSF to GO Mapping PIRSF to GO mapping provides a link between GO concepts and protein objects Mapped 5500 curated PIRSF homeomorphic families and subfamilies to the GO hierarchy Superimpose GO and PIRSF hierarchies Bidirectional display (GO-centric or PIRSF-centric views) DynGO viewer Hongfang Liu, University of Maryland

11 Protein Ontology Can Complement GO Expanding a Node Identification of GO subtrees that need expansion if GO concepts are too broad ~ 67% of curated PIRSF families and subfamilies map to GO leaf nodes Among these, 2209 PIRSFs have shared GO leaf nodes (many PIRSFs to 1 GO leaf) Example: PIRSF vs PIRSF and PIRSF : High- vs low-affinity IGF binding Identification of missing GO nodes

12 Protein Ontology Can Complement GO Identification of Missing GO Nodes (higher levels)

13 Protein Ontology Can Complement GO Mechanism to examine the relationships between the three GO ontologies based on the shared annotations at different protein family levels Example: molecular function “ estrogen receptor activity ” and biological process “ signal transduction ”, “ estrogen receptor signaling pathway ” Linking Function, Biological Process, and Cellular Component through a Protein Object Based on Protein Annotations

14 PIRSF Protein Classification: a link between GO and protein objects Annotation Quality Annotation of biological function of whole proteins Annotation of uncharacterized “hypothetical” proteins Correction of annotation errors and underannotations Standardization of Protein Names PIRSF to GO mapping provides a link between GO sub- ontologies and protein objects

15 PIRSF-based Protein Ontology Can Complement GO Identification of GO subtrees that need expansion if GO concepts are too broad Comprehensive classification of related protein families in PIRSF can help in identification of missing GO nodes when entire groups of PIRSF superfamilies or families cannot be mapped to existing GO terms Mechanism to examine the relationships between the three GO ontologies (molecular function, biological process, and cellular component), as well as between GO concepts, based on the shared annotations at different protein family levels

16 Acknowledgements Hongfang Liu, University of Maryland Judith Blake, The Jackson Laboratory Dr. Cathy Wu, Director Protein Classification team Dr. Winona Barker Dr. Lai-Su Yeh Dr. Anastasia Nikolskaya Dr. Darren Natale Dr. Zhangzhi Hu Dr. Raja Mazumder Dr. CR Vinayaka Dr. Xianying Wei Dr. Sona Vasudevan Informatics team Dr. Hongzhan Huang Baris Suzek, M.S. Sehee Chung, M.S. Dr. Leslie Arminski Dr. Hsing-Kuo Hua Yongxing Chen, M.S. Jing Zhang, M.S. Amar Kalelkar Students Christina Fang Vincent Hormoso Natalia Petrova Jorge Castro-Alvear PIR Team UniProt (SwissProt, TrEMBL, PIR)