Understanding proteins: resources for identification and annotation.

Slides:



Advertisements
Similar presentations
Annotation of Gene Function …and how thats useful to you.
Advertisements

Duncan Legge EMBL-EBI. Introduction to InterPro Introduction to InterPro Introduction to Protein Signatures & InterPro.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Protein function and classification Hsin-Yu Chang
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
What is an ontology and Why should you care? Barry Smith with thanks to Jane Lomax, Gene Ontology Consortium 1.
Protein Databases EBI – European Bioinformatics Institute
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
The Protein Data Bank (PDB)
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Proteins and Protein Function Charles Yan Spring 2006.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
Demonstration Trupti Joshi Computer Science Department 317 Engineering Building North (O)
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
UniProt - The Universal Protein Resource
Claire O’Donovan EMBL-EBI. In UniProtKB, we aim to provide… o A high quality protein sequence database A non redundant protein database, with maximal.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Overview of Bioinformatics A/P Shoba Ranganathan Justin Choo National University of Singapore A Tutorial on Bioinformatics.
Protein function and classification Hsin-Yu Chang
Protein function and classification Hsin-Yu Chang
Automatic methods for functional annotation of sequences Petri Törönen.
Development of Bioinformatics and its application on Biotechnology
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
Bringing Structure to Biology: Small Molecules and the PDBe
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Gene Ontology (GO) Project
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
The aims of the Gene Ontology project are threefold: - to compile vocabularies to describe components, functions and processes - to produce tools to query.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
UniProt Non-redundant Reference Cluster (UniRef) Databases Swiss Institute of Bioinformatics (SIB) European Bioinformatics Institute (EMBL-EBI)
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Gene Ontology Project
Grup.bio.unipd.it CRIBI Genomics group Erika Feltrin PhD student in Biotechnology 6 months at EBI.
Gene Ontology TM (GO) Consortium Jennifer I Clark EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK Objectives:
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Protein and RNA Families
PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Rice Proteins Data acquisition Curation Resources Development and integration of controlled vocabulary Gene Ontology Trait Ontology Plant Ontology
EBI is an Outstation of the European Molecular Biology Laboratory. Protein Database in Europe Deposition, Validation, Search and Analysis Services.
Bioinformatics and Computational Biology
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Describing Bioinformatic Metadata at EBI James Malone
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
OncoTrack Bioinformatics Workshop Max Planck Institute for Molecular Genetics, Berlin Wednesday 6 th November 2013 TimeSubject 13:30-15:00 Introduction.
Protein families, domains and motifs in functional prediction May 31, 2016.
Cheminformatics and Metabolism Team The EBI Enzyme Portal.
Protein databases Henrik Nielsen
Demo: Protein Information Resource
Archives and Information Retrieval
Sandra Orchard EMBL-EBI
UniProt: Universal Protein Resource
Genome Annotation Continued
There are four levels of structure in proteins
Introduction to Bioinformatics
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Understanding proteins: resources for identification and annotation

The Gene Ontology: Annotating protein function, role and localization Contact: Jane Lomax Coordinator, GO Editorial Office EBI-EMBL

What is an ontology?

→Collectibles & art →Stamps →UK (Great Britain)Victoria →1884 GREAT BRITAIN 10S SCOTT (11,999.99$) A definition... “A controlled representation of ideas, concepts or events in a given domain and the relationships between them.”

Why do we need ontologies? Help with data retrieval allow grouping of annotations brain20 hindbrain15 rhombomere10 Adapted from Barry Smith: Query ‘brain’ without ontology20 Query ‘brain’ with ontology45 Make data (re-)usable through standards Common structure and terminology (controlled vocabulary) Avoid redundancies (single data source) Allow common tools, techniques, training, validation...

Gene ontology What is the gene ontology? Organized, controlled vocabulary of terms that describe gene products characteristics. Represents gene product properties, not gene products themselves Three branches (domains):  Cellular component  Molecular function  Biological process Species-independent (with taxonomic restrictions) Represents physiological processes Goes up to the level of the cell

The Gene Ontology is like a dictionary term: transcription initiation definition: Processes involved in the assembly of the RNA polymerase complex at the promoter region of a DNA template resulting in the subsequent synthesis of RNA from that promoter. id: GO: How does GO work?

Clark et al., 2005 part_of is_a GO tree and annotations

GO terms for Caspase 9 An annotation example…

attacked time control Puparial adhesion Molting cycle hemocyanin Defense response Immune response Response to stimulus Toll regulated genes JAK-STAT regulated genes Immune response Toll regulated genes Amino acid catabolism Lipid metobolism Peptidase activity Protein catabloism Immune response Which processes are up- or down- regulated? Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI.

QuickGO: browsing GO Term definition

QuickGO: browsing GO Term relationships (ancestors)

QuickGO: browsing GO Term relationships (children)

QuickGO: browsing GO Proteins annotated to term

Annotation and ontology files Ontology files: Hold ontology terms and structure Species-independent You can get GO-slims Annotation files: Hold list of terms and the proteins annotated with them You can get species- specific files or the whole annotation.

More about GO: EBI train online

Acknowledgements & questions Jane Lomax Coordinator, GO Editorial Office EBI-EMBL

UniProt: A repository of annotated protein sequences Contact: Duncan Legge UniProt Content Team EBI-EMBL

Background of UniProt Since 2002 a merger and collaboration of three databases: Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database Swiss-Prot & TrEMBLPIR-PSD

We Aim To Provide… o A high quality protein sequence database A non redundant protein database, with maximal coverage including splice isoforms, disease variant and PTMs. Sequence archiving essential. o Easy protein identification Stable identifiers and consistent nomenclature / controlled vocabularies o Thorough protein annotation Detailed information on protein function, biological processes, molecular interactions and pathways cross-referenced to external source

The Two Sides of UniProtKB Non-redundant, high-quality manual annotation - reviewed Redundant, automatically annotated - unreviewed UniProtKB/TrEMBL 1 entry per nucleotide submission UniProtKB/Swiss-Prot 1 entry per protein

UniProtKB/Swiss-Prot Manually annotated UniProtKB/TrEMBL Computationally annotated

Data sources of UniProtKB UniProt/TrEMBL VEGA (Sanger) WormBase FlyBase Sub/ Peptide Data PDB Patent Data Ensembl ENA (EMBL) DNA database mRNA Data

Curation of a UniProt/SwissProt entry Sequence Sequence variants Nomenclature Sequence features UniProt/TrEMBL UniProt/SwissProt Ontologies Literature Annotations References

UniProt Website

UniProt layout

Annotation comments FUNCTION SUBCELLULAR LOCATION ALTERNATIVE PRODUCTS TISSUE SPECIFICITY DEVELOPMENTAL STAGE INDUCTION SIMILARITY CATALYTIC ACTIVITY COFACTOR ENZYME REGULATION BIOPHYSICOCHEMICAL- PROPERTIES PATHWAY SUBUNIT INTERACTION PTM RNA EDITING MASS SPECTROMETRY DOMAIN POLYMORPHISM DISRUPTION PHENOTYPE ALLERGEN DISEASE TOXIC DOSE BIOTECHNOLOGY PHARMACEUTICAL MISCELLANEOUS CAUTION SEQUENCE CAUTION WEB RESOURCE

Controlled vocabularies used whenever possible Evidence tags to show source

Master headline

Proteomes in UniProt Complete proteomes Complete sets of proteins thought to be expressed by organisms whose genomes have been completely sequenced. Reference proteomes Some complete proteomes have been selected as reference proteome sets. These cover the proteomes of well- studied model organisms and other proteomes of interest for biomedical research.

Obtaining Proteomes

Help / Feedback Stuck? Just ask – active help and support team Feedback – if you find something incorrect, outdated, missing etc please tell us.

Find out more: EBI online courses

Acknowledgements & questions Duncan Legge UniProt Content Team EBI-EMBL

InterPro: An integrated protein sequence analysis resource Contact: Amaia Sangrador InterPro curation Team EBI-EMBL

What is InterPro? InterPro is a sequence analysis resource that classifies sequences into protein families and predicts important domains and sites It combines predictive models (known as signatures) from different databases to provide functional analysis of protein sequences by classifying them into families and predicting domains and important sites

The aim of InterPro InterPro

Protein annotation: a predictive approach This is the approach taken by protein signature databases Model the pattern of conserved amino acids at specific positions within a multiple sequence alignment We can use these models to infer relationships with the characterised sequences from which the alignment was constructed

Full alignment methods Single motif methods Patterns Multiple motif methods Fingerprints Three (4) different protein signature approaches Profiles & Hidden Markov models (HMMs)

Structural domains Functional annotation of families/domains Protein features (sites) Hidden Markov Models Finger prints Profiles Patterns HAMAP InterPro Consortium

DatabaseBasisInstitution Built from FocusURL PfamHMMSanger Institute Sequence alignment Family & Domain based on conserved sequence Gene3DHMMUCL Structure alignment Structural Domain c.uk/Gene3D/ SuperfamilyHMMUni. of Bristol Structure alignment Evolutionary domain relationships SUPERFAMILY/ SMARTHMMEMBL Heidelberg Sequence alignment Functional domain annotation heidelberg.de/ TIGRFAMHMMJ. Craig Venter Inst. Sequence alignment Microbial Functional Family Classification arch/projects/tigrfams/overv iew/ PantherHMMUni. S. California Sequence alignment Family functional classification PIRSFHMM PIR, Georgetown, Washington D.C. Sequence alignment Functional classification www/dbinfo/pirsf.shtml PRINTS Fingerprints Uni. of Manchester Sequence alignment Family functional classification r.ac.uk/dbbrowser/PRINTS/i ndex.php PROSITE Patterns & Profiles SIB Sequence alignment Functional annotation HAMAPProfilesSIB Sequence alignment Microbial protein family classification ap/ ProDom Sequence clustering PRABI : Rhône-Alpes Bioinformatics Center Sequence alignment Conserved domain prediction m/current/html/home.php

Signatures are provided by member databases They are scanned against the UniProt database to see which sequences they match Curators manually inspect the matches before integrating the signatures into InterPro InterPro signature integration process  Signatures representing the same entity are integrated together  Relationships between entries are traced, where possible  Curators add literature referenced abstracts, cross-refs to other databases, and GO terms

Search using the key word: CD4 Let’s find some information about T-cell surface antigen CD4 in InterPro Using InterPro

Results from the “CD4” key word search

Type Name Identifier Contributing signatures Description Go terms References Family-centered view

Search using human CD4 protein sequence Using InterPro

Type Name Identifier Domains Family Protein-centered view

Type Name Identifier Contributing signatures Description References Domain-centered view

Using InterPro with unknown sequences: InterProScan Search with unknown protein sequence InterProScan is the software package that allows sequences to be scanned against InterPro's signatures

InterPro entries and contributing signatures Unintegrated signatures (not reviewed)

InterPro usage within the EBI Used by UniProtKB curators in their annotation of Swiss-Prot proteins Forms part of the automated system that adds annotation to UniProtKB/TrEMBL Provides matches to over 80% of UniProtKB Source of >60 million Gene Ontology (GO) mappings to >17 million distinct UniProtKB sequences outside the EBI 50,000 unique visitors to the web site per month > 2 million sequences searched online per month Plus offline searches with downloadable version

Probabilistic models != biological certainty We are using biologically-unaware search tools and probabilistic models Ask questions, weigh the evidence Remember!

Caveats We need your feedback! missing/additional references reporting problems requests Sheer amount of data can be overwhelming Member databases do not always agree! InterPro entries are based on signatures supplied to us by our member databases....this means no signature, no entry!

Find out more: EBI online courses

Acknowledgements & questions Amaia Sangrador InterPro curation team EBI-EMBL

PDBe: Protein Data Bank in Europe Contact: Gary Battle Project Leader Outreach PDBe

PDBe overview Mission: Bringing Structure to Biology Major activities: Deposition and annotation site for structural data on biomacromolecules (X-ray, NMR, EM) Integration of macromolecular structure data with important biological and chemical data resources Provide tools and services for accessing, exploiting and disseminating structural data to the wider biomedical community

Worldwide Protein Data Bank (wwPDB)

PDBeXplore Browse the PDB using familiar classification systems (enzymes, folds, families, compounds, taxonomy, sequence). Latest structures: pdbe.org/pdbexplore

PDBePISA Exploration of macromolecular (protein, DNA/RNA and ligand) interfaces and prediction of probable quaternary structures. Predict quaternary structure: pdbe.org/pisa

PDBeFold Interactive comparison, alignment and superposition based on protein secondary structure. Find similar structures: pdbe.org/fold

PDBeMotif Flexible 3D search and analysis of protein-ligand interactions, binding environments and structural motifs. Analyse binding sites and motifs: pdbe.org/motif

NMR resources and services Visualisation and validation of NMR models and data. NMR resources: pdbe.org/nmr

EM resources and services Comprehensive search and analysis tools for EMDB entries. EM resources: pdbe.org/em

Electron Microscopy Data Bank (EMDB) Global public repository for EM density maps of macromolecular complexes and subcellular structures Founded at EBI in 2002 Jointly operated by PDBe, RCSB and NCMI PDBe EM portal provides advanced search, visualisation and analysis services.

Educational resources: Quips Interactive exploration of interesting structures from the PDB Quite interesting PDB structures: pdbe.org/quips

Stay informed…

Find out more: EBI online courses

Acknowledgements & questions Gary Battle EBI-EMBL