Understanding proteins: resources for identification and annotation
The Gene Ontology: Annotating protein function, role and localization Contact: Jane Lomax Coordinator, GO Editorial Office EBI-EMBL
What is an ontology?
→Collectibles & art →Stamps →UK (Great Britain)Victoria →1884 GREAT BRITAIN 10S SCOTT (11,999.99$) A definition... “A controlled representation of ideas, concepts or events in a given domain and the relationships between them.”
Why do we need ontologies? Help with data retrieval allow grouping of annotations brain20 hindbrain15 rhombomere10 Adapted from Barry Smith: Query ‘brain’ without ontology20 Query ‘brain’ with ontology45 Make data (re-)usable through standards Common structure and terminology (controlled vocabulary) Avoid redundancies (single data source) Allow common tools, techniques, training, validation...
Gene ontology What is the gene ontology? Organized, controlled vocabulary of terms that describe gene products characteristics. Represents gene product properties, not gene products themselves Three branches (domains): Cellular component Molecular function Biological process Species-independent (with taxonomic restrictions) Represents physiological processes Goes up to the level of the cell
The Gene Ontology is like a dictionary term: transcription initiation definition: Processes involved in the assembly of the RNA polymerase complex at the promoter region of a DNA template resulting in the subsequent synthesis of RNA from that promoter. id: GO: How does GO work?
Clark et al., 2005 part_of is_a GO tree and annotations
GO terms for Caspase 9 An annotation example…
attacked time control Puparial adhesion Molting cycle hemocyanin Defense response Immune response Response to stimulus Toll regulated genes JAK-STAT regulated genes Immune response Toll regulated genes Amino acid catabolism Lipid metobolism Peptidase activity Protein catabloism Immune response Which processes are up- or down- regulated? Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI.
QuickGO: browsing GO Term definition
QuickGO: browsing GO Term relationships (ancestors)
QuickGO: browsing GO Term relationships (children)
QuickGO: browsing GO Proteins annotated to term
Annotation and ontology files Ontology files: Hold ontology terms and structure Species-independent You can get GO-slims Annotation files: Hold list of terms and the proteins annotated with them You can get species- specific files or the whole annotation.
More about GO: EBI train online
Acknowledgements & questions Jane Lomax Coordinator, GO Editorial Office EBI-EMBL
UniProt: A repository of annotated protein sequences Contact: Duncan Legge UniProt Content Team EBI-EMBL
Background of UniProt Since 2002 a merger and collaboration of three databases: Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database Swiss-Prot & TrEMBLPIR-PSD
We Aim To Provide… o A high quality protein sequence database A non redundant protein database, with maximal coverage including splice isoforms, disease variant and PTMs. Sequence archiving essential. o Easy protein identification Stable identifiers and consistent nomenclature / controlled vocabularies o Thorough protein annotation Detailed information on protein function, biological processes, molecular interactions and pathways cross-referenced to external source
The Two Sides of UniProtKB Non-redundant, high-quality manual annotation - reviewed Redundant, automatically annotated - unreviewed UniProtKB/TrEMBL 1 entry per nucleotide submission UniProtKB/Swiss-Prot 1 entry per protein
UniProtKB/Swiss-Prot Manually annotated UniProtKB/TrEMBL Computationally annotated
Data sources of UniProtKB UniProt/TrEMBL VEGA (Sanger) WormBase FlyBase Sub/ Peptide Data PDB Patent Data Ensembl ENA (EMBL) DNA database mRNA Data
Curation of a UniProt/SwissProt entry Sequence Sequence variants Nomenclature Sequence features UniProt/TrEMBL UniProt/SwissProt Ontologies Literature Annotations References
UniProt Website
UniProt layout
Annotation comments FUNCTION SUBCELLULAR LOCATION ALTERNATIVE PRODUCTS TISSUE SPECIFICITY DEVELOPMENTAL STAGE INDUCTION SIMILARITY CATALYTIC ACTIVITY COFACTOR ENZYME REGULATION BIOPHYSICOCHEMICAL- PROPERTIES PATHWAY SUBUNIT INTERACTION PTM RNA EDITING MASS SPECTROMETRY DOMAIN POLYMORPHISM DISRUPTION PHENOTYPE ALLERGEN DISEASE TOXIC DOSE BIOTECHNOLOGY PHARMACEUTICAL MISCELLANEOUS CAUTION SEQUENCE CAUTION WEB RESOURCE
Controlled vocabularies used whenever possible Evidence tags to show source
Master headline
Proteomes in UniProt Complete proteomes Complete sets of proteins thought to be expressed by organisms whose genomes have been completely sequenced. Reference proteomes Some complete proteomes have been selected as reference proteome sets. These cover the proteomes of well- studied model organisms and other proteomes of interest for biomedical research.
Obtaining Proteomes
Help / Feedback Stuck? Just ask – active help and support team Feedback – if you find something incorrect, outdated, missing etc please tell us.
Find out more: EBI online courses
Acknowledgements & questions Duncan Legge UniProt Content Team EBI-EMBL
InterPro: An integrated protein sequence analysis resource Contact: Amaia Sangrador InterPro curation Team EBI-EMBL
What is InterPro? InterPro is a sequence analysis resource that classifies sequences into protein families and predicts important domains and sites It combines predictive models (known as signatures) from different databases to provide functional analysis of protein sequences by classifying them into families and predicting domains and important sites
The aim of InterPro InterPro
Protein annotation: a predictive approach This is the approach taken by protein signature databases Model the pattern of conserved amino acids at specific positions within a multiple sequence alignment We can use these models to infer relationships with the characterised sequences from which the alignment was constructed
Full alignment methods Single motif methods Patterns Multiple motif methods Fingerprints Three (4) different protein signature approaches Profiles & Hidden Markov models (HMMs)
Structural domains Functional annotation of families/domains Protein features (sites) Hidden Markov Models Finger prints Profiles Patterns HAMAP InterPro Consortium
DatabaseBasisInstitution Built from FocusURL PfamHMMSanger Institute Sequence alignment Family & Domain based on conserved sequence Gene3DHMMUCL Structure alignment Structural Domain c.uk/Gene3D/ SuperfamilyHMMUni. of Bristol Structure alignment Evolutionary domain relationships SUPERFAMILY/ SMARTHMMEMBL Heidelberg Sequence alignment Functional domain annotation heidelberg.de/ TIGRFAMHMMJ. Craig Venter Inst. Sequence alignment Microbial Functional Family Classification arch/projects/tigrfams/overv iew/ PantherHMMUni. S. California Sequence alignment Family functional classification PIRSFHMM PIR, Georgetown, Washington D.C. Sequence alignment Functional classification www/dbinfo/pirsf.shtml PRINTS Fingerprints Uni. of Manchester Sequence alignment Family functional classification r.ac.uk/dbbrowser/PRINTS/i ndex.php PROSITE Patterns & Profiles SIB Sequence alignment Functional annotation HAMAPProfilesSIB Sequence alignment Microbial protein family classification ap/ ProDom Sequence clustering PRABI : Rhône-Alpes Bioinformatics Center Sequence alignment Conserved domain prediction m/current/html/home.php
Signatures are provided by member databases They are scanned against the UniProt database to see which sequences they match Curators manually inspect the matches before integrating the signatures into InterPro InterPro signature integration process Signatures representing the same entity are integrated together Relationships between entries are traced, where possible Curators add literature referenced abstracts, cross-refs to other databases, and GO terms
Search using the key word: CD4 Let’s find some information about T-cell surface antigen CD4 in InterPro Using InterPro
Results from the “CD4” key word search
Type Name Identifier Contributing signatures Description Go terms References Family-centered view
Search using human CD4 protein sequence Using InterPro
Type Name Identifier Domains Family Protein-centered view
Type Name Identifier Contributing signatures Description References Domain-centered view
Using InterPro with unknown sequences: InterProScan Search with unknown protein sequence InterProScan is the software package that allows sequences to be scanned against InterPro's signatures
InterPro entries and contributing signatures Unintegrated signatures (not reviewed)
InterPro usage within the EBI Used by UniProtKB curators in their annotation of Swiss-Prot proteins Forms part of the automated system that adds annotation to UniProtKB/TrEMBL Provides matches to over 80% of UniProtKB Source of >60 million Gene Ontology (GO) mappings to >17 million distinct UniProtKB sequences outside the EBI 50,000 unique visitors to the web site per month > 2 million sequences searched online per month Plus offline searches with downloadable version
Probabilistic models != biological certainty We are using biologically-unaware search tools and probabilistic models Ask questions, weigh the evidence Remember!
Caveats We need your feedback! missing/additional references reporting problems requests Sheer amount of data can be overwhelming Member databases do not always agree! InterPro entries are based on signatures supplied to us by our member databases....this means no signature, no entry!
Find out more: EBI online courses
Acknowledgements & questions Amaia Sangrador InterPro curation team EBI-EMBL
PDBe: Protein Data Bank in Europe Contact: Gary Battle Project Leader Outreach PDBe
PDBe overview Mission: Bringing Structure to Biology Major activities: Deposition and annotation site for structural data on biomacromolecules (X-ray, NMR, EM) Integration of macromolecular structure data with important biological and chemical data resources Provide tools and services for accessing, exploiting and disseminating structural data to the wider biomedical community
Worldwide Protein Data Bank (wwPDB)
PDBeXplore Browse the PDB using familiar classification systems (enzymes, folds, families, compounds, taxonomy, sequence). Latest structures: pdbe.org/pdbexplore
PDBePISA Exploration of macromolecular (protein, DNA/RNA and ligand) interfaces and prediction of probable quaternary structures. Predict quaternary structure: pdbe.org/pisa
PDBeFold Interactive comparison, alignment and superposition based on protein secondary structure. Find similar structures: pdbe.org/fold
PDBeMotif Flexible 3D search and analysis of protein-ligand interactions, binding environments and structural motifs. Analyse binding sites and motifs: pdbe.org/motif
NMR resources and services Visualisation and validation of NMR models and data. NMR resources: pdbe.org/nmr
EM resources and services Comprehensive search and analysis tools for EMDB entries. EM resources: pdbe.org/em
Electron Microscopy Data Bank (EMDB) Global public repository for EM density maps of macromolecular complexes and subcellular structures Founded at EBI in 2002 Jointly operated by PDBe, RCSB and NCMI PDBe EM portal provides advanced search, visualisation and analysis services.
Educational resources: Quips Interactive exploration of interesting structures from the PDB Quite interesting PDB structures: pdbe.org/quips
Stay informed…
Find out more: EBI online courses
Acknowledgements & questions Gary Battle EBI-EMBL