Presentation on theme: "Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR) Hu ZZ 1, Mani I 2, Liu H 3, Vijay-Shanker K 4, Hermoso."— Presentation transcript:
Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR) Hu ZZ 1, Mani I 2, Liu H 3, Vijay-Shanker K 4, Hermoso V 1, Nikolskaya A 1, Natale DA 1, and Wu CH 1 1 Protein Information Resource, Georgetown University Medical Center, 3900 Reservoir Road, NW, Washington, DC 20057; 2 Georgetown University, 37th and O Streets, NW, Washington, DC 20057; 3 University of Maryland at Baltimore County, Baltimore, MD 21250; 4 Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716 PIRSF in DAG View PIRSF family hierarchy based on evolutionary relationships Standardized PIRSF family names as hierarchical protein ontology DAG Network structure for PIRSF family classification system PIRSF-Based Protein Ontology ABSTRACT An integrated protein literature mining resource iProLINK is developed at PIR to provide data sources for Natural Language Processing (NLP) research on bibliography mapping, annotation extraction, protein named-entity recognition, and protein ontology development. A rule-based text-mining system RLIMS-P is used to extract protein phosphorylation information from MEDLINE abstracts to assist database annotation, an online BioThesaurus is developed for protein/gene name mapping and to assist with protein named-entity recognition, and a family classification PIRSF-based protein ontology is developed and to complement other ontologies. As the volume of scientific literature rapidly grows, literature data mining becomes increasingly critical to facilitate genome/proteome annotation and to improve the quality of biological databases. Annotations derived from experimentally verified data from literature are of special value to the UniProtKB (UniProt Knowledgbase). One objective of UniProtKB is to have accurate, consistent, and rich annotation of protein sequence and function. Relevant to this goal are the literature-based curation and development and adoption of ontologies and controlled vocabularies. Literature-Based Curation – Extract Reliable Information from Literature Protein properties: protein function, domains and sites, developmental stages, catalytic activity, binding and modified residues, regulation, induction, pathways, tissue specificity, subcellular location, quaternary structure… This will ensure high quality, accurate and up-to-date experimental data for each protein. But it is a major bottleneck! Ontologies/Controlled Vocabularies – For Information Integration and Knowledge Management UniProtKB entries will be annotated using widely accepted biological ontologies and other controlled vocabularies, e.g. Gene Ontology (GO) and EC nomenclature. The Protein Information Resource has been collaborating with several NLP research groups to develop text-mining methodologies to extract information from biological literature and to develop protein ontology. INTRODUCTION PIR – Integrated Protein Informatics Resource for Genomic/Proteomic Research (http://pir.georgetown.edu) UniProt – Central international database of protein sequence and function (http://www.uniprot.org) Bioinformatics. 2005 Jun 1;21(11):2759-65 High recall for paper retrieval and high precision for information extraction UniProtKB site feature annotation Proteomics MS data analysis: protein identification Benchmarking of RLIMS-P Sentence extraction Part of speech tagging Preprocessing Acronym detection Term recognition Entity Recognition Noun and verb group detection Other syntactic structure detection Phrase Detection Semantic Type Classification Nominal level relation Verbal level relation Relation Identification Abstracts Full-Length Texts Post- Processing Extracted Annotations Tagged Abstracts Pattern 1: (in/at )? ATR/FRP-1 also phosphorylated p53 in Ser 15 http://pir.georgetown.edu/iprolink/ RLIMS-P Rule-based LIterature Mining System for Protein Phosphorylation RLIMS-P Protein Phosphorylation Annotation Extraction Manual tagging assisted with computational extraction Training sets of positive and negative samples BioThesaurus report UniProtKB entry P35625 Tagging guideline versions 1.0 and 2.0 –Generation of domain expert-tagged corpora –Inter-coder reliability – upper bound of machine tagging Dictionary pre-tagging –F-measure: 0.412 (0.372 Precision, 0.462 Recall) –Advantages: helpful with standardization and extent of tagging, reducing fatigue problem, and improve inter-coder reliability. BioThesaurus for pre-tagging Raw Thesurus iProClass NCBI Entrez Gene RefSeq GenPept UniProt UniProtKB UniRef90/50 PIR-PSD Genome FlyBase WormBase MGD SGD RGD Other HUGO EC OMIM Name Filtering Highly Ambiguous Nonsensical Terms Semantic Typing UMLS Name Extraction UniProtKB Entries: Protein/Gene Names & Synonyms BioThesaurus Biological entity tagging Name mapping Database annotation literature mining Gateway to other resources Applications: # UniProtKB entry1.86m # Source DB record6.6m # Gene/protein name/terms3.6m BioThesaurus v1.0 m = million (May, 2005) Protein Name Tagging Example 2. Name ambiguity of CLIM1 PIRSF to GO Mapping Superimpose GO and PIRSF hierarchies Bidirectional display (GO- or PIRSF-centric views ) Complements GO: PIRSF-based ontology can be used to analyze GO branches and concepts and to provide links between the GO sub-ontologies Mapped 5363 curated PIRSF homeomorphic families and subfamilies to the GO hierarchy –68% of the PIRSF families and subfamilies map to GO leaf nodes –2329 PIRSFs have shared GO leaf nodes DynGO viewer Two cases: analyze GO branches and concepts and identify missing GO nodes Case I. Nuclear receptor superfamily Case II. IGF-binding protein superfamily iProLINK: An integrated protein resource for literature mining 1. Bibliography mapping - UniProt mapped citations 2. Annotation extraction - annotation tagged literature 3. Protein entity recognition - dictionary, tagged literature 4. Protein ontology development - PIRSF-based ontology http://pir.georgetown.edu/iprolink/ Testing and Benchmarking Dataset RLIMS-P text mining tool Protein dictionaries Name tagging guideline Protein ontology 3 4 56 Protein Ontology Can Complement GO Expanding a Node: Identification of GO subtrees that need expansion if GO concepts are too broad –IGFBP subfamilies –High- vs. low-affinity binding for IGF between IGFBP and IGFBPrP GO-centric view 2 1 Exploration of Gene and Protein Ontology PIRSF-centric view 1 Molecular function Biological process Estrogen receptor alpha (PIRSF50001) Systematic links between three GO sub-ontologies based on the shared annotations at different protein family levels, e.g., linking molecular function and biological process: –estrogen receptor binding and –estrogen receptor signaling pathway Acknowledgements Research Projects NIH: NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR (UniProt) NSF: SEIII (Entity Tagging) NSF: ITR (Ontology) Collaborators I. Mani from Georgetown University Department of Linguistics on protein name recognition and protein name ontology. Department of Linguisticsprotein name ontology H. Liu from University of Maryland Department of Information System on protein name recognition and text mining.Department of Information System Vijay K. Shanker from University of Delaware Department of Computer and Information Science on text mining of protein phosphorylation features. Department of Computer and Information Science Summary PIR iProLINK literature mining resource provides annotated data sets for NLP research on annotation extraction and protein ontology development RLIMS-P text-mining tool for protein phosphorylation from PubMed literature. Coupling the high recall for paper retrieval and high precision for information extraction, RLIMS-P can be applied for UniProtKB protein feature annotation. Biothesaurus can be used to solve name synonym and ambiguity, name mapping. PIRSF-based protein ontology can complement GO by identify missing GO concepts/nodes and provides systematic links between three GO sub-ontologies. 7 8 PIRSF: A network structure from superfamilies to subfamilies to reflect evolutionary relationships of full-length proteins Definitions Basic unit = Homeomorphic Family Homeomorphic: Full-length similarity, common domain architecture Network Structure: Flexible number of levels with varying degrees of sequence conservation PIRSF Protein Family Classification Example 1. Name ambiguity of TIMP3 http://pir.georgetown.edu/iprolink/biothesaurus/ Web-based BioThesaurus Gene/Protein Name Mapping 1.Search Synonyms 2.Resolve Name Ambiguity 3.Underlying ID Mapping Online RLIMS-P text-mining tool (version 1.0) http://pir.georgetown.edu/i prolink/rlimsp/ 1 2 1. Search interface 2. Summary table with top hit of all sites 3. All sites and tagged text evidence 3 DAG file: ftp://ftp.pir.georgetown.edu/pir_databases/pirsf/dagfiles/ Liu et al, 2005, submitted
Your consent to our cookies if you continue to use this website.