Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES: ANNOTATION AND.

Slides:

Advertisements

Similar presentations

Journal Club Jenny Gu October 24, Introduction Defining the subset of Superfamilies in LUCA Examine adaptability and expansion of particular superfamilies.

Advertisements

MitoInteractome : Mitochondrial Protein Interactome Database Rohit Reja Korean Bioinformation Center, Daejeon, Korea.

Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.

Basics of Comparative Genomics Dr G. P. S. Raghava.

EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.

©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.

Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis

Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.

FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES:

1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.

Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.

Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.

Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.

Similar Sequence Similar Function Charles Yan Spring 2006.

Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.

EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:

Biological Data Integration July 22, 2003 GTL Data and Tools Workshop Gaithersburg, MD Cathy H. Wu, Ph.D. Professor of Biochemistry & Molecular Biology.

Proteomics and Protein Bioinformatics: Functional Analysis of Protein Sequences Anastasia Nikolskaya Assistant Professor (Research) Protein Information.

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

Protein Classification A comparison of function inference techniques.

Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.

Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.

Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.

Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)

Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.

1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.

BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD

Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

1 Protein Bioinformatics – Advances and Challenges Sona Vasudevan Peter McGarvey BY.

Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

? Functional Site rule: tags active site, binding, other residue- specific information Functional Annotation rule: gives name, EC, other activity- specific.

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.

Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.

Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center Washington, DC PIR: a comprehensive resource for functional.

BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.

Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009

PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.

Protein Information Resource Protein Information Resource, 3300 Whitehaven St., Georgetown University, Washington, DC Contact

Protein and RNA Families

1 The PIRSF Protein Classification System as a Basis for Automated UniProt Protein Annotation Darren A. Natale, Ph.D. Project Manager and Senior Scientist,

Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.

PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),

Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.

Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.

Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center PIRSF PROTEIN CLASSIFICATION SYSTEM AND SEQUENCE ANNOTATION.

I. Prolinks: a database of protein functional linkage derived from coevolution II. STRING: known and predicted protein-protein associations, integrated.

Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.

Bioinformatics and Computational Biology

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.

1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,

EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.

March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.

InterPro Sandra Orchard.

Proteomics and Protein Bioinformatics: Functional Analysis of Protein Sequences Anastasia Nikolskaya Assistant Professor (Research) Protein Information.

Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas

 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?

Protein families, domains and motifs in functional prediction May 31, 2016.

Bio/Chem-informatics

FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES:

Demo: Protein Information Resource

Basics of Comparative Genomics

Sequence based searches:

Genome Annotation Continued

Predicting Active Site Residue Annotations in the Pfam Database

PANTHER (Protein Analysis Through Evolutionary Relationships): Trees, Hidden Markov Models, Biological Annotations Paul Thomas, Ph.D. Division of Bioinformatics.

Protein Sequence Analysis - Overview -

Basics of Comparative Genomics

Presentation transcript:

Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES: ANNOTATION AND FAMILY CLASSIFICATION

2 Most new protein sequences come from genome sequencing projects Many have unknown functions Large-scale functional annotation of these sequences based simply on BLAST best hit has pitfalls; results are far from perfect Problem: Overview Highly curated and annotated protein classification system Automatic annotation of sequences based on protein families Solution for Large-scale Annotation: Full-length protein family classification based on evolution Highly annotated, optimized for annotation propagation Functional predictions for uncharacterized proteins Used to facilitate and standardize annotations in UniProt PIRSF Protein Classification System Functional Analysis of Protein Sequences: Homology-based (sequence analysis, structure analysis) Non-homology (genome context, phylogenetic distribution)

3 Proteomics and Bioinformatics Bioinformatics Computational analysis and integration of these data Making predictions (function etc), reconstructing pathways Data: Gene expression profiling Genome-wide analysis of gene expression Data: Protein-protein interaction Data: Structural genomics 3D structures of all protein families Data: Genome projects (Sequencing) ….

4 What’s In It For Me? When an experiment yields a sequence (or a set of sequences), we need to find out as much as we can about this protein and its possible function from available data Especially important for poorly characterized or uncharacterized (“hypothetical”) proteins More challenging for large sets of sequences generated by large-scale proteomics experiments The quality of this assessment is often critical for interpreting experimental results and making hypothesis for future experiments Sequence function

5 Genomic DNA Sequence 5' UTRPromoter Exon1 IntronExon2 Intron Exon33' UTR A G G T A G Gene Recognition Exon2Exon1Exon3 C A C A C A A T T A T A Protein Sequence A T G A A T A A A Structure Determination Protein Structure Function Analysis Gene Network Metabolic Pathway Protein Family Molecular Evolution Family Classification G T Gene DNA Sequence Gene Protein Sequence Function Work with Protein, not DNA Sequence

6 The Changing Face of Protein Science 20 th century Few well-studied proteins Mostly globular with enzymatic activity Biased protein set 21 st century Many “hypothetical” proteins (Most new proteins come from genome sequencing projects, many have unknown functions) Various, often with no enzymatic activity Natural protein set Credit: Dr. M. Galperin, NCBI

7 Knowing the Complete Genome Sequence All encoded proteins can be predicted and identified The missing functions can be identified and analyzed Peculiarities and novelties in each organism can be studied Predictions can be made and verified Advantages: Challenge: Accurate assignment of known or predicted functions (functional annotation)

8 E. coli M. jannaschii S. cerevisiae H. sapiens Characterized experimentally Characterized by similarity Unknown, conserved Unknown, no similarity from Koonin and Galperin, 2003, with modifications

9 Experimentally characterized Find up-to-date information, accurate interpretation Characterized by similarity (“knowns”) = closely related to experimentally characterized Avoid propagation of errors Function can be predicted (no close sequence similarity, may be distant similarity to characterized proteins) Extract maximum possible information, avoid errors and overpredictions Most value-added (fill the gaps in metabolic pathways, etc) “Unknowns” (conserved or unique) Rank by importance Functional Annotation for Different Groups of Proteins

10 Protein Sequence Function Automatic assignment based on sequence similarity (best BLAST hit): gene name, protein name, function Large-scale functional annotation of sequences based simply on BLAST best hit has pitfalls; results are far from perfect To avoid mistakes, need human intervention (manual annotation) How are Protein Sequences Annotated? “regular approach” Quality vs Quantity

11 Experimentally characterized Find up-to-date information, accurate interpretation Characterized by similarity (“knowns”) = closely related to experimentally characterized Avoid propagation of errors Function can be predicted (no close sequence similarity, may be distant similarity to characterized proteins) Extract maximum possible information, avoid errors and overpredictions Most value-added (fill the gaps in metabolic pathways, etc) “Unknowns” (conserved or unique) Rank by importance Functional Annotation for Different Groups of Proteins

12 Misinterpreted experimental results (e.g. suppressors, cofactors) Biologically senseless annotations Arabidopsis: separation anxiety protein-like Helicobacter: brute force protein Methanococcus: centromere-binding protein Plasmodium: frameshift “Goofy” mistakes of sequence comparison (e.g. abc1/ABC) Multi-domain organization of proteins Low sequence complexity (coiled-coil, transmembrane, non- globular regions) Enzyme evolution: - Divergence in sequence and function (minor mutation in active site) - Non-orthologous gene displacement: Convergent evolution Problems in Functional Assignments for “Knowns”

13 Problems in Functional Assignments for “Knowns”: multi-domain organization of proteins ACT domain Chorismate mutase domainACT domain New sequence Chorismate mutase BLAST In BLAST output, top hits are to chorismate mutases -> The name “chorismate mutase” is automatically assigned to new sequence. ERROR ! (protein gets erroneous name, EC number, assigned to erroneous pathway, etc)

14 Previous low quality annotations lead to propagation of mistakes Problems in Functional Assignments for “Knowns”

15 Experimentally characterized Find up-to-date information, accurate interpretation Characterized by similarity (“knowns”) = closely related to experimentally characterized Avoid propagation of errors Function can be predicted (no close sequence similarity, may be distant similarity to characterized proteins) Extract maximum possible information, avoid errors and overpredictions Most value-added (fill the gaps in metabolic pathways, etc) “Unknowns” (conserved or unique) Rank by importance Functional Annotation for Different Groups of Proteins

16 in non-obvious cases: Sophisticated database searches (PSI-BLAST, HMM) Detailed manual analysis of sequence similarities Structure-guided alignments and structure analysis Often, only general function can be predicted: Enzyme activity can be predicted, the substrate remains unknown (ATPases, GTPases, oxidoreductases, methyltransferases, acetyltransferases) Helix-turn-helix motif proteins (predicted transcriptional regulators) Membrane transporters Functional Prediction: I. Sequence and Structure Analysis ( homology-based methods)

17 Proteins (domains) with different 3D folds are not homologous (unrelated by origin). Proteins with similar 3D folds are usually (but not always) homologous Those amino acids that are conserved in divergent proteins within a (super)family are likely to be functionally important (catalytic or binding sites, ect). Reaction chemistry often remains conserved even when sequence diverges almost beyond recognition Using Sequence Analysis: Hints

18 Prediction of 3D fold (if distant homologs have known structures!) and of general biochemical function is much easier than prediction of exact biological function Sequence analysis complements structural comparisons and can greatly benefit from them Comparative analysis allows us to find subtle sequence similarities in proteins that would not have been noticed otherwise Using Sequence Analysis: Hints Credit: Dr. M. Galperin, NCBI

19 Structural Genomics: Structure-Based Functional Predictions Methanococcus jannaschii MJ0577 (Hypothetical Protein) Contains bound ATP => ATPase or ATP-Mediated Molecular Switch Confirmed by biochemical experiments Protein Structure Initiative: Determine 3D structures of all protein families

20 Crystal Structure is Not a Function! Credit: Dr. M. Galperin, NCBI

21 Phylogenetic distribution (comparative genomics) Wide - most likely essential Narrow - probably clade-specific Patchy - most intriguing Domain association – “Rosetta Stone” Genome context (gene neighborhood, operon organization) Functional Prediction: II. Computational Analysis Beyond Homology Clues: specific to niche, pathway type

22 Using Genome Context for Functional Prediction Embden-Meyerhof and Gluconeogenesis pathway: 6-phosphofructokinase (EC ) SEED analysis tool (by FIG)

23 Functional Prediction: Problem Areas Identification of protein-coding regions Delineation of potential function(s) for distant paralogs Identification of domains in the absence of close homologs Analysis of proteins with low sequence complexity

24 What to do with a new protein sequence Basic: - Domain analysis (SMART = most sensitive; PFAM= most complete, CDD) - BLAST - Curated protein family databases (PIRSF, InterPro, COGs) - Literature (PubMed) from links from individual entries on BLAST output (look for SwissProt entries first) If not sufficient: - PSI-BLAST - Refined PubMed search using gene/protein names, synonyms, function and other terms you found - Genome neighborhood (prokaryotes) Advanced: - Multiple sequence alignments (manual) - Structure-guided alignments and structure analysis - Phylogenetic tree reconstruction

25 Case Study: Prediction Verified: GGDEF domain Proteins containing this domain: Caulobacter crescentus PleD controls swarmer cell - stalk cell transition (Hecht and Newton, 1995). In Rhizobium leguminosarum, Acetobacter xylinum, required for cellulose biosynthesis (regulation) Predicted to be involved in signal transduction because it is found in fusions with other signaling domains (receiver, etc) In Acetobacter xylinum, cyclic di-GMP is a specific nucleotide regulator of cellulose synthase (signalling molecule). Multidomain protein with GGDEF domain was shown to have diguanylate cyclase activity (Tal et al., 1998) Detailed sequence analysis tentatively predicts GGDEF to be a diguanylate cyclase domain (Pei and Grishin, 2001) Complementation experiments prove diguanylate cyclase activity of GGDEF (Ausmees et al., 2001)

26 Most new protein sequences come from genome sequencing projects Many have unknown functions Large-scale functional annotation of these sequences based simply on BLAST best hit has pitfalls; results are far from perfect Manual annotation of individual proteins is not efficient Problem: The Need for Classification Highly curated and annotated protein classification system Automatic annotation of sequences based on protein families Solution: Automatic annotation of sequences based on protein families Systematic correction of annotation errors Protein name standardization Functional predictions for uncharacterized proteins Facilitates: This all works only if the system is optimized for annotation

27 Levels of Protein Classification LevelExampleSimilarityEvolution Class // Structural elementsNo relationships FoldTIM-BarrelTopology of backbonePossible monophyly Domain Superfamily AldolaseRecognizable sequence similarity (motifs); basic biochemistry Monophyletic origin FamilyClass I AldolaseHigh sequence similarity (alignments); biochemical properties Evolution by ancient duplications Orthologous group 2-keto-3-deoxy-6- phosphogluconate aldolase Orthology for a given set of species; biochemical activity; biological function Traceable to a single gene in LCA Lineage- specific expansion (LSE) PA3131 and PA3181 Paralogy within a lineageRecent duplication

28 Protein Evolution With enough similarity, one can trace back to a common origin Sequence changes What about these? Domain shuffling Domain: Evolutionary/Functional/Structural Unit

29 PDT? CM/PDH? Consequences of Domain Shuffling PIRSF CM (AroQ type) PDTACT PIRSF CM (AroQ type) PIRSF PDH PIRSF PIRSF PDH ACT PDTACT PIRSF CM = chorismate mutase PDH = prephenate dehydrogenase PDT = prephenate dehydratase ACT = regulatory domain PDH? CM/PDT? CM? PDH CM (AroQ type)

30 Peptidase M22AcylphosphataseZnFYrdCZnF Whole Protein = Sum of its Parts? On the basis of domain composition alone, biological function was predicted to be: ● RNA-binding translation factor ● maturation protease PIRSF Actual function: ● [NiFe]-hydrogenase maturation factor, carbamoyltransferase Full-length protein functional annotation is best done using annotated full-length protein families

31 Practical classification of proteins: setting realistic goals We strive to reconstruct the natural classification of proteins to the fullest possible extent BUT Domain shuffling rapidly degrades the continuity in the protein structure (faster than sequence divergence degrades similarity) THUS The further we extend the classification, the finer is the domain structure we need to consider SO We need to compromise between the depth of analysis and protein integrity OR … Credit: Dr. Y. Wolf, NCBI

32 Domain Classification Allows a hierarchy that can trace evolution to the deepest possible level, the last point of traceable homology and common origin Can usually annotate only general biochemical function Full-length protein Classification Cannot build a hierarchy deep along the evolutionary tree because of domain shuffling Can usually annotate specific biological function (preferred to annotate individual proteins)  Can map domains onto proteins  Can classify proteins even when domains are not defined Complementary Approaches

33 Levels of Protein Classification LevelExampleSimilarityEvolution Class // Structural elementsNo relationships FoldTIM-BarrelTopology of backbonePossible monophyly Domain Superfamily AldolaseRecognizable sequence similarity (motifs); basic biochemistry Monophyletic origin FamilyClass I AldolaseHigh sequence similarity (alignments); biochemical properties Evolution by ancient duplications Orthologous group 2-keto-3-deoxy-6- phosphogluconate aldolase Orthology for a given set of species; biochemical activity; biological function Traceable to a single gene in LCA Lineage- specific expansion (LSE) PA3131 and PA3181 Paralogy within a lineageRecent duplication

34 Full-length protein classification PIRSF Domain classification Pfam SMART CDD Mixed TIGRFAMS COGs Based on structural fold SCOP Protein Classification Databases InterPro: integrates various types of classification databases

35 Integrated resource for protein families, domains and sites. Combines a number of databases: PROSITE, PRINTS, Pfam, SMART, ProDom, TIGRFAMs, PIRSF SF Bifunctional chorismate mutase/ prephenate dehydratase InterPro CM PDT ACT

36 The Ideal System… Comprehensive: each sequence is classified either as a member of a family or as an “orphan” sequence Hierarchical: families are united into superfamilies on the basis of distant homology, and divided into subfamilies on the basis of close homology Allows for simultaneous use of the full-length protein and domain information (domains mapped onto proteins) Allows for automatic classification/annotation of new sequences when these sequences are classifiable into the existing families Expertly curated membership, family name, function, background, etc. Evidence attribution (experimental vs predicted)

37 PIRSF Classification System PIRSF: Reflects evolutionary relationships of full-length proteins A network structure from superfamilies to subfamilies Definitions: Homeomorphic Family: Basic Unit Homologous: Common ancestry, inferred by sequence similarity Homeomorphic: Full-length similarity & common domain architecture Hierarchy: Flexible number of levels with varying degrees of sequence conservation Network Structure: allows multiple parents Advantages: Annotate both general biochemical and specific biological functions Accurate propagation of annotation and development of standardized protein nomenclature and ontology

38 PIRSF Classification System A protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.

39 Creation and Curation of PIRSFs UniProtKB proteins Preliminary Homeomorphic Families Orphans Curated Homeomorphic Families Final Homeomorphic Families Add/remove members Name, refs, description Automatic clustering Computer- assisted Manual Curation Automatic Procedure Unassigned proteins Automatic placement Create hierarchies (superfamilies/subfamilies) Map domains on Families Merge/split clusters New proteins Protein name rule/site rule Computer- Generated (Uncurated) Clusters Preliminary Curation (4,700 PIRSFs) Membership Signature Domains Full Curation (3,300 PIRSFs) Family Name, Description, Bibliography PIRSF Name Rules Build and test HMMs

40 Taxonomic distribution of PIRSF can be used to infer evolutionary history of the proteins in the PIRSF PIRSF Family Report: Curated Protein Family Information Phylogenetic tree and alignment view allows further sequence analysis

41 PIRSF Hierarchy and Network: DAG Viewer

42 PIRSF Family Report (II) Integrated value added information from other databases Mapping to other protein classification databases

43 PIRSF Protein Classification: Platform for Protein Analysis and Annotation Improves automatic annotation quality Serves as a protein analysis platform for broad range of users Matching a protein sequence to a curated protein family rather than searching against a protein database Provides value-added information by expert curators, e.g., annotation of uncharacterized hypothetical proteins (functional predictions)

44 Name Rules Hierarchy PIRSF Classification Name Site Rules Family-Driven Protein Annotation Objective: Optimize for protein annotation PIRSF Classification Name Reflects the function when possible Indicates the maximum specificity that still describes the entire group Standardized format Name tags: validated, tentative, predicted, functionally heterogeneous Hierarchy Subfamilies increase specificity (kinase -> sugar kinase -> hexokinase ) Name Rules Define conditions under which names propagate to individual proteins Enable further specificity based on taxonomy or motifs Names adhere to Swiss-Prot conventions (though we may make suggestions for improvement) Site Rules Define conditions under which features propagate to individual proteins

45 PIR Name Rules Monitor such variables to ensure accurate propagation Account for functional variations within one PIRSF, including: Lack of active site residues necessary for enzymatic activity Certain activities relevant only to one part of the taxonomic tree Evolutionarily-related proteins whose biochemical activities are known to differ Propagate other properties that describe function: EC, GO terms, misnomer info, pathway Name Rule types: “Zero” Rule Default rule (only condition is membership in the appropriate family) Information is suitable for every member “Higher-Order” Rule Has requirements in addition to membership Can have multiple rules that may or may not have mutually exclusive conditions

46 Example Name Rules Rule IDRule ConditionsPropagated Information PIRNR PIRSF member and vertebrates Name: S-acyl fatty acid synthase thioesterase PIRNR PIRSF member and not vertebrates Name: Type II thioesterase PIRNR PIRSF member Name: ACT domain protein Misnomer: chorismate mutase Note the lack of a zero rule for PIRSF000881

47 Affiliation of Sequence: Homeomorphic Family or Subfamily (whichever PIRSF is the lowest possible node) No Yes Assign name from Name Rule 1 (or 2 etc) Protein fits criteria for any higher-order rule? No Yes Nothing to propagate Assign name from Name Rule 0 PIRSF has zero rule? Yes No Nothing to propagate Name Rule Propagation Pipeline Name rule exists?

48 Name Rule in Action at UniProt Current: Automatic annotations (AA) are in a separate field AA only visible from Future: Automatic name annotations will become DE line if DE line will improve as a result AA will be visible from all consortium-hosted web sites

49 PIR Site Rules Position-Specific Site Features: active sites binding sites modified amino acids Current requirements: at least one PDB structure experimental data on functional sites Rule Definition: Select template structure Align PIRSF seed members with structural template Edit alignment to retain conserved regions covering all site residues Build Site HMM from concatenated conserved regions

50 Match Rule Conditions Only propagate site annotation if all rule conditions are met: Membership Check (PIRSF HMM threshold) Ensures that the annotation is appropriate Conserved Region Check (site HMM threshold) Residue Check (all position-specific residues in HMMAlign)

51 Rule-based Annotation of Protein Entries ? Functional variations within one PIRSF (family or subfamily): binding sites with different specificity Monitor such variables for accurate propagation Site Rules Feed Name Rules Functional Site rule: tags active site, binding, other residue-specific information Functional Annotation rule: gives name, EC, other activity-specific information

52 Most new protein sequences come from genome sequencing projects Many have unknown functions Large-scale functional annotation of these sequences based simply on BLAST best hit has pitfalls; results are far from perfect Problem: Overview Highly curated and annotated protein classification system Automatic annotation of sequences based on protein families Solution for Large-scale Annotation: Functional Analysis of Protein Sequences: Homology-based (sequence analysis, structure analysis) Non-homology (genome context, phylogenetic distribution) Automatic annotation of sequences based on protein families Systematic correction of annotation errors Name standardization in UniProt Functional predictions for uncharacterized proteins Facilitates:

53 Impact of Protein Bioinformatics and Genomics Single protein level Discovery of new enzymes and superfamilies Prediction of active sites and 3D structures Pathway level Identification of “missing” enzymes Prediction of alternative enzyme forms Identification of potential drug targets Cellular metabolism level Multisubunit protein systems Membrane energy transducers Cellular signaling systems

54 PIR Team Dr. Cathy Wu, Director Protein Science team Dr. Darren Natale (lead) Dr. Peter McGarvey Dr. Cecilia Arighi Dr. Anastasia Nikolskaya Dr. Winona BarkerDr. Sona Vasudevan Dr. Zhang-zhi HuDr. CR Vinayaka Dr. Raja Mazumder Dr. Lai-Su Yeh Bioinformatics team Dr. Hongzhan Huang (lead) Yongxing Chen, M.S. Dr. Leslie ArminskiBaris Suzek, M.S. Dr. Hsing-Kuo HuaXin Yuan, M.S. Dr. Robel Kahsay Jian Zhang, M.S. Students Natalia Petrova UniProt Collaborators Dr. Rolf Apweiler (EBI)Dr. Amos Bairoch (SIB) UniProt is supported by the National Institutes of Health, grant # 1 U01 HG