Presentation is loading. Please wait.

Presentation is loading. Please wait.

Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis.

Similar presentations


Presentation on theme: "Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis."— Presentation transcript:

1 Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis of Proteins Celebrating the 20th anniversary of Swiss-Prot Fortaleza, Brazil August 4, 2006 Cathy H. Wu, Ph.D. Director, Protein Information Resource Professor, Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

2 2 Wu CH, Zhao S, Chen HL. (1996) A protein class database organized with PROSITE protein groups and PIR superfamilies. Journal of Computational Biology, 3 (4), 547-562.

3 3 Protein Information Resource (PIR) UniProt Universal Protein Resource: Central Resource of Protein Sequence and Function PIRSF Family Classification System: Protein Classification and Functional Annotation iProClass Integrated Protein Database: Data Integration and Protein Mapping iProLINK Literature Mining Resource: Annotation Extraction Other Projects: NIAID Proteomics, caBIG Grid-Enablement Integrated Protein Informatics Resource for Genomic/Proteomic Research http://pir.georgetown.edu

4 4 PIR Protein Sequence Database The PIR-International Protein Sequence Database (PIR-PSD) grew out of the Atlas of Protein Sequence and Structure (1965-1978), Vol 1-5, Suppl 1-3. Margaret Dayhoff collected all the known protein sequences to study protein evolution. The first Atlas contained 65 proteins, the final volume had 1081 proteins. The PIR-PSD was produced from 1984 (Release 1, 2900 proteins) to 2004 (Release 80, 283,416 proteins). PIR-PSD has been integrated with the UniProt since 2002.

5 5 UniProt Activities at PIR Integration of PIR-PSD into UniProtKB Incorporation of unique PIR entries Incorporation of PIR annotations: references, experimental features with literature evidence tag Functional annotation of UniProtKB proteins Development of PIRSF family classification system & PIRSF curation => Comprehensive coverage of all UniProtKB proteins Development of rule-based annotation system & PIRNR (name rule) /PIRSR (site rule) curation => Rule curation and integration into Swiss-Prot/TrEMBL annotation pipelines & propagation of annotations (e.g., name, GO, site feature) Production of UniRef100/90/50 databases Creation of UniProt web site and help system => Unified UniProt web site & user community interaction

6 6 PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Curated families with name rules and site rules Curation platform with classification/visualization tools Dissemination: UniProtKB annotations, InterPro families, PIRSF reports, PIRSF curation platform Protein Classification and Functional Annotation

7 7 iProClass Integrated Protein Database Data integration from >90 databases Underlying data warehouse for protein ID/name/bibliography mapping & pre-computed BLAST results Integration of protein family, function, structure for functional annotation Rich link (link + summary) for value-added reports of UniProt proteins Data Integration and Protein Mapping

8 8 iProLINK Text Mining Resource Curated datasets and literature corpus for development of literature mining and annotation extraction tools RLIMS-P text-mining tool for extracting protein phosphorylation data BioThesaurus of gene/protein names to resolve synonym and ambiguity Annotation Extraction and Literature-Based Protein Annotation

9 9 NIAID Biodefense Proteomic Program Goals Characterize proteomes of pathogens and host cells Identify proteins associated with the biology of the microbes Elucidate mechanisms of microbial pathogenesis Understand immune responses and non-immune mediated host responses Adm Ctr PRC Data Type Organism

10 10 Multiple Data Types from Proteomics Research Centers Data Integration at NIAID Admin Center Integrated Data at VBI Data Exchange Format Controlled Vocabulary Ontology Master Protein Directory & Complete Proteomes at GU-PIR iProClass UniProt PIRSF Protein ID Peptide/Protein Sequence Mapping Rich annotation - capture experimental data and scientific conclusion; integrate with major databases http://pir.georgetown.edu/proteomics/

11 11 NCI caBIG Initiative caBIG (cancer Biomedical Informatics Grid) Cancer research platform to enable sharing of research infrastructure, data, tools Designed and built by an open federation of organizations Based on common standards and open source/open access principles One of four caBIG grid reference projects PIR Grid-Enablement: UniProtKB as central protein information resource for cancer research caBIG Workspaces Integrative Cancer Research PIR Developer Project: Grid Enablement of PIR PIR Adopter Project: SEED Genome Annotation PIR Adopter Project: GeneConnect ID mapping Vocabularies and Common Data Elements PIR Participant Project: Protein models, objects, vocabularies, ontologies caGrid Architecture

12 12 UniProt Knowledgebase: Accurate, Consistent, and Rich Annotation of Protein Sequence and Function Family Classification-Driven and Rule-Based Curation Functional inference of uncharacterized hypothetical proteins Systematic detection and correction of genome annotation errors Improvement of under- or over-annotated proteins Text Mining-Assisted and Literature-Based Curation Annotation extraction from scientific literature Attribution of experimental evidence Ontology and Controlled Vocabulary-Based Curation Standardization of protein/gene/family names and annotation terms Annotation of specific protein entities

13 13 PIR Superfamily Classification Tree of Life and Evolution of Protein Families (Dayhoff) The protein superfamily concept (1976) was based on sequence similarity, where sequences were categorized into superfamilies, families, subfamilies, and entries using different % identity thresholds.

14 14 PIRSF Classification System A network classification system from superfamily to subfamily levels to reflect the evolutionary relationships of full-length proteins and domains Basic unit is homeomorphic family: Full-length similarity, common domain architecture Provide annotation of generic biochemical and specific biological functions Basis for evolutionary and comparative genomics research Basis for accurate and consistent automated protein annotation (protein name, biochemical and biological functions, functional sites) Basis for standardization of protein names and development of ontology for protein evolution

15 15

16 16 PIRSF Classification/Curation Workflow 1.Computational generation of homeomorphic clusters 2.Computational domain mapping and annotation of preliminary clusters 3.Automatic placement of new proteins into families 4.Computer-assisted expert analysis to define homeomorphic families 5.Family hierarchy created as needed 6.Expert annotation 7.Name rules and optional site rules created 8.Seed members to generate family HMMs

17 17 PIRSF Classification Tools Iterative BlastClust Tree with Annotation Table Multiple Alignment and Phylogenetic Tree PIRSF Classification in DAG Editor ISMB: PIRSF Protein Classification System Demo

18 18 PIRSF Analysis/Visualization Tools Taxonomy Distribution and Phylogenetic Pattern Domain Display Family Hierarchy (DAG Browser)

19 19 PIRSF Family Report Curated family name Description of family Sequence analysis tools

20 20 ATP_PFK_DR0635 ATP_PFK_euk PPi_PFK_PfpB PPi_PFK_TM0289 PPi_PFK_TP0108 PPi_PFK_SMc01852 PFK_XF0274 E. coli (P06998) Gly105 Gly125 ATP-PFK: Gly105 + Gly125 PPi-PFK: Gly/Asp105 + Lys125 Example - Phosphofructokinase (PFK) classification shows that functional specialization can occur as a result not only of major sequence changes but also by mutation of a single amino-acid residue. Classification and Functional Annotation Families Classification Tree

21 21 Family-Based Rules for Annotation ? Functional Site Rule: tags active site, binding, other residue-specific information Functional Name Rule: gives name, EC, GO, other function-specific information

22 22 iProLINK Literature Mining Resource

23 23 iProLINK Literature Mining Resource 1.UniProtKB Bibliography mapping in iProClass 2.RLIMS-P Rule-based NLP method for extracting protein phosphorylation data 3.Substring-based machine learning method for PTM text categorization 4.BioThesaurus of protein/gene names with UniProtKB association 5.Entity-named tagging Guide 3 1 2 4 5

24 24 Literature Corpus for Text Mining Literature survey and manual tagging for evidence attribution Training and benchmarking sets for information retrieval and extraction Protein phosphorylation data used to develop RLIMS-P for extracting phosphorylation information The five PTM datasets used to develop a machine learning algorithm for text categorization

25 25 Online RLIMS-P A 1.Summary table: PMIDs & top-ranking annotation 1 2.Report: Full annotation with evidence tagging and PMID mapping to UniProtKB entry 2 3.Name mapping searches BioThesaurus 3

26 26 BioThesaurus Comprehensive collection of protein/gene names from 23 databases Associate names (~3.2 million) with UniProtKB entries (>2 million) Web-based searches to retrieve synonymous names, resolve ambiguous names, evaluate name coverage FTP download for automatic dictionary-based named entity tagging

27 27 Online BioThersaurus 12 1. Search protein entries sharing the same names 2. Retrieve BioThesaurus report Name ambiguity of CLIM1Annotation error detection

28 28 Synonyms for Metalloproteinase inhibitor 3 1 2 Name ambiguity of TIMP-3 BioThesaurus Report Gene/Protein Name Mapping 1. Search Synonyms 2. Resolve Name Ambiguity 3. Underlying ID Mapping 3 ID Mapping

29 29 Protein Ontology (PRO) PRotein Ontology (PRO) in OBO (Open Biomedical Ontologies) Framework Two sub-ontologies: Ontology for Protein Evolution (ProEvo) for the classification of proteins on the basis of evolutionary relationships Ontology for Protein Modified Forms (ProMod) to represent the multiple protein forms of a gene (genetic variation, alternative splicing, proteolytic cleavage, and post-translational modification). Why PRO? Allow the specification of relationships between PRO and other ontologies, such as GO and Disease Ontology Facilitate precise protein annotation of specific proteins/classes The PRO prototype is illustrated using human proteins from the TGF- beta signaling pathway (http://pir.georgetown.edu/pro).

30 30 PRO Conceptual Framework

31 31 Protein Ontology (PRO)

32 32 PIR Team Protein Science Team: Darren Natale, Winona Barker, Peter McGarvey, Zhangzhi Hu, Lai-Su Yeh, Anastasia Nikolskaya, Raja Mazumder, CR Vinayaka, Sona Vasudevan, Cecilia Arighi, Xin Yuan Informatics Team: Hongzhan Huang, Baris Suzek, Leslie Arminski, Hsing- Kuo Hua, Yongxing Chen, Jing Zhang, Robel Kahsay, Jess Cannata Students: Natalia Petrova, Paul Ramos, Ti-Cheng Chang, Anna Bank Collaborators UniProt: Rolf Apweiler, Amos Bairoch and EBI/SIB Teams NIAID: Margaret Moore (SSS), Bruno Sobral (VBI) Text Mining: Hongfang Liu (GUMC), Interjeet Mani (MITRE), Vijay Shanker (U Delaware), Zoran Obradovic (Temple U) Funding Support NHGRI/NIGMS (UniProt) NCI caBIG NIAID (Proteomic Admin Center) NSF: iProClass, text mining Acknowledgements


Download ppt "Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis."

Similar presentations


Ads by Google