1 Protein Bioinformatics – Advances and Challenges Sona Vasudevan Peter McGarvey BY.

Slides:



Advertisements
Similar presentations
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Advertisements

Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Ontology annotation: mapping genomic regions biological function Paul D Thomas, Huaiyu Mi and Suzanna Lewis.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Bioinformatics and Phylogenetic Analysis
Protein Databases EBI – European Bioinformatics Institute
The Protein Data Bank (PDB)
Proteins and Protein Function Charles Yan Spring 2006.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Biological Data Integration July 22, 2003 GTL Data and Tools Workshop Gaithersburg, MD Cathy H. Wu, Ph.D. Professor of Biochemistry & Molecular Biology.
Protein and Function Databases
UniProt - The Universal Protein Resource
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center
? Functional Site rule: tags active site, binding, other residue- specific information Functional Annotation rule: gives name, EC, other activity- specific.
Biological Databases By : Lim Yun Ping E mail :
UniProt Non-redundant Reference Cluster (UniRef) Databases Swiss Institute of Bioinformatics (SIB) European Bioinformatics Institute (EMBL-EBI)
Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis.
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center Washington, DC PIR: a comprehensive resource for functional.
Bioinformatics at NIAID-Biodefense Proteomics Administrative Resource Center Peter McGarvey Ph.D. Senior Bioinformatics Scientist, Project Manager Protein.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein Information Resource Protein Information Resource, 3300 Whitehaven St., Georgetown University, Washington, DC Contact
Protein and RNA Families
PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Computer Storage of Sequences
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Sequence Based Analysis Tutorial
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Copyright OpenHelix. No use or reproduction without express written consent1.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
? Functional Site rule: tags active site, binding, other residue- specific information Functional Annotation rule: gives name, EC, other activity- specific.
InterPro Sandra Orchard.
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Protein databases Henrik Nielsen
Demo: Protein Information Resource
Archives and Information Retrieval
생물정보학 Bioinformatics.
UniProt: Universal Protein Resource
Genome Annotation Continued
UniProt: the Universal Protein Resource
PIR: Protein Information Resource
Genomes and Their Evolution
Introduction to Bioinformatics
PIR Bio-defense Related Pathogen Data Mining
Sequence Based Analysis Tutorial
Tutorial: Bioinformatics Resources
Protein Sequence Analysis - Overview -
Sequence Based Analysis Tutorial
Protein Sequence Analysis - Overview -
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

1 Protein Bioinformatics – Advances and Challenges Sona Vasudevan Peter McGarvey BY

2Outline What is Bioinformatics? Past & PresentWhat is Bioinformatics? Past & Present About PIRAbout PIR PIR resourcesPIR resources UniProt resourcesUniProt resources PIR’s leading role in CaBig; Biodefense and OntologyPIR’s leading role in CaBig; Biodefense and Ontology

3 What is Bioinformatics? NIH Biomedical Information Science and Technology Initiative (BISTI) Working Definition (2000)  Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Computer + Mouse = Bioinformatics (Information) (Biology)

4 “A science which hesitates to forget its founders is lost.” ---- A. N. Whitehead

5 Dr. Margaret Oakley Dayhoff (1925 – 1983) The origin of the single-letter code for the amino acids Evolution of Protein databases (Georgetown University)

6 Challenges we are facing today! Total number of sequences in NR ~ 4,919,302 Total number of environmental sequences ~6,028,191(NCBI) Number of domain Families (Pfam) ~ 8957 Number of domain Families (SMART) ~ 665 Number of Structures (PDB) ~ Number of COGS ~4873 (Unicellular) ~4852 (Eukaryote)

7 Molecular Biology Databases 719 Databases in 14 categories The DNA sequence database has exceeded 100 gigabases.

8 the birth of “omes” & "omic" era in biology

9 Genomics Proteomics Unknomics Functionomics Metagenomics

10

11 Protein Information Resource  UniProt Universal Protein Resource: Central Resource of Protein Sequence and Function  PIRSF Protein Family Classification System: Protein Classification and Functional Annotation  iProClass Integrated Protein Knowledgebase: Data Integration and Functional Associative Analysis Integrated Protein Informatics Resource for Proteomics Research

12 UniProt Databases  UniParc: Comprehensive Sequence Archive with Sequence History  UniProt: Knowledgebase with Full Classification and Functional Annotation  UniRef: Non-redundant Reference Databases for Sequence Search Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging

13 UniProt Knowledgebase  Objective: Stable, Comprehensive, Fully Classified, Richly and Accurately Annotated  Information Content Isoform Presentation Isoform Presentation Nomenclature Nomenclature Family Classification and Domain Identification Family Classification and Domain Identification Functional Annotation Functional Annotation  Approaches Full Classification Full Classification Automated Annotation Automated Annotation Literature-Based Curation Literature-Based Curation Database Cross-References Database Cross-References Controlled Vocabularies & Ontologies Controlled Vocabularies & Ontologies Evidence Attribution Evidence Attribution

14 PIRSF Classification System  PIRSF: Reflects evolutionary relationships of full-length proteins Reflects evolutionary relationships of full-length proteins A network structure from superfamilies to subfamilies A network structure from superfamilies to subfamilies  Definitions: Homeomorphic Family (HF): Basic Unit Homeomorphic Family (HF): Basic Unit Homologous: Common ancestry, inferred by sequence similarity Homologous: Common ancestry, inferred by sequence similarity Homeomorphic: Full-length similarity & common domain architecture Homeomorphic: Full-length similarity & common domain architecture Hierarchy: Flexible number of levels with varying degrees of sequence conservation Hierarchy: Flexible number of levels with varying degrees of sequence conservation Network Structure: Allows multiple parents Network Structure: Allows multiple parents  Advantages: Annotate both general biochemical and specific biological functions Annotate both general biochemical and specific biological functions Accurate propagation of annotation and development of standardized protein nomenclature and ontology Accurate propagation of annotation and development of standardized protein nomenclature and ontology Credit AN Nikolskaya

15 PIRSF Classification System Protein Classification and Functional Annotation (  Comprehensive Classification of All UniProt Proteins  Curated Families with Protein Name and Site Rules  Classification and Visualization Tools Taxonomy Distribution and Phylogenetic Pattern Iterative BlastClust Tree with Annotation Table, MSA & Phylogenetic tree

16 Classification Tool: BlastClust  Curator- guided clustering  Single- linkage clustering using BLAST  Retrieve all proteins sharing a common domain  Iterative BlastClust (fixed length coverage)

17 PIRSF-Based Protein Annotation Classification-Driven Rule-Based Annotation Provides Consistent Annotation and Database Integrity Check Includes: Site Rule (PIRSR): Position-Specific Site Feature (FT) Name Rule (PIRNR): transfer name from PIRSF to individual proteins Protein Name (DE) with Synonym, EC, Misnomer GO Term Rule ID Rule Condition Rule Description (Name Rule Interface) PIRNR PIRSF member and vertebrates Name: S-acyl fatty acid synthase thioesterase EC: oleoyl-[acyl-carrier-protein] hydrolase (EC ) PIRNR PIRSF member and not vertebrates Name: Type II thioesterase EC: thiolester hydrolases (EC ) PIRNR PIRSF member Name: ACT domain protein Misnomer: chorismate mutase

18 Rule-based Annotation of Protein Entries Using PIRSF StructureBinding/active sitesIdentification of residues

19Methodology  Defining a Rule Select template structure Select template structure Align curated PIRSF seed members and structural template Align curated PIRSF seed members and structural template Structure-based sequence alignment of seeds Structure-based sequence alignment of seeds Edit MSA retaining conserved regions covering all site residues Edit MSA retaining conserved regions covering all site residues Build Site HMM from concatenated conserved regions Build Site HMM from concatenated conserved regions  Rule Condition Membership Check (PIRSF HMM threshold) Membership Check (PIRSF HMM threshold) Conserved Region Check (site HMM threshold) Conserved Region Check (site HMM threshold) Site Residue Check (position-specific residue in HMMAlign) Site Residue Check (position-specific residue in HMMAlign)  Rule Propagation Propagate conserved feature annotation to all members that fit the rule Propagate conserved feature annotation to all members that fit the rule

20 An example of PIR rule Integrated into SP record PIR Rule

21 PIRSF Protein Classification provides a platform for protein annotation  Improves Annotation Quality Annotation of biological function of whole proteins Annotation of biological function of whole proteins Annotation of uncharacterized hypothetical proteins (functional predictions helped by newly detected family relationships) Annotation of uncharacterized hypothetical proteins (functional predictions helped by newly detected family relationships) Correction of annotation errors Correction of annotation errors Improvement of under- or over-annotated proteins Improvement of under- or over-annotated proteins  Standardization of Protein Names

22 Data Integration  Data Warehouse Local Copy of Databases in a Unified Database Schema Local Copy of Databases in a Unified Database Schema Allows Local Control of Data; Update Problem Allows Local Control of Data; Update Problem  Hypertext Navigation Browsing Model with Hypertext Links Browsing Model with Hypertext Links Allows Direct Interaction; Easily Lost in Cyberspace Allows Direct Interaction; Easily Lost in Cyberspace  iProClass Approach Data Warehouse + Hypertext Navigation Data Warehouse + Hypertext Navigation Rich Links (Links + Executive Summaries) Rich Links (Links + Executive Summaries) Modular and Open Framework for Adding New Components in Distributed Networking Environment Modular and Open Framework for Adding New Components in Distributed Networking Environment

23 iProClass Database  ~5,000,000 Protein Sequences  Rich Links to >80 Databases  Value-Added Views for UniProt Integrated Protein Family, Function, Structure Integrated Protein Family, Function, Structure Information

24 iProClass Views Sequence Report Family Report

25 PIR iProClass Searches Text Search Peptide Search BLAST Search ID Mapping

26 1.Albert Einstein College of Medicine T. gondii, C. parvum 2.Caprion Pharmaceuticals B. abortus 3.Harvard Institute of Proteomics V. cholerae, B. anthracis 4.Myriad Genetics B. anthracis, Y. pestis, F. tularensis, Vaccinia, Variola 5.Pacific Northwest National Laboratory S. typhimurium, S. typhi, Vaccinia, Monkeypox 6.Scripps SARS CoV, Influenza 7.University of Michigan B. anthracis Scripps Caprion Myriad Harvard U of Michigan Albert Einstein PNNL Resource Center SSS PIRVBI DATADATA

27 Organism Research Center Data Type

28 Currently contains 3,733 ORF Clones out of 3,784 Proteins Master Protein Directory 29 Colonization Pathway Proteins

29 Protein Summary ReportClone Sequences Order Clones from Repositories Protein and Reagent Information Search for Related Proteins in Catalog by Family Classification or Similarity Searches

Mouse proteins detected in B. anthracis and S. typhimurium infected macrophages

NCI caBIG Initiative cancer Biomedical Informatics Grid: Informatics platform to enable sharing of research, data and tools Designed and built by an open federation of organizations Facilitate connectivity via common standards and unifying architecture Open source and open access principles Domain Workspaces Clinical Trial Management Systems Integrative Cancer Research Imaging Tissue Banks and Pathology Tools Cross Cutting Workspaces Architecture Vocabularies and Common Data Elements

PIR Activities in caBIG™ Integrative Cancer Research Workspace Developer Grid-enablement of PIR Adopter SEED Genome Annotation Tool (completed) GeneConnect Genomic Identifier Mapping Service Vocabularies and Common Data Elements Participant

33