Presentation is loading. Please wait.

Presentation is loading. Please wait.

Literature Data Mining and Protein Ontology Development

Similar presentations


Presentation on theme: "Literature Data Mining and Protein Ontology Development"— Presentation transcript:

1 Literature Data Mining and Protein Ontology Development
At the Protein Information Resource (PIR) Hu ZZ*, Mani I, Liu H, Hermoso V, Vijay-Shanker K, Nikolskaya A, Natale DA, and Wu CH ISMB 2005, Detroit, Michigan June 29, 2005 Zhang-Zhi Hu, M.D. Senior Bioinformatics Scientist, PIR Georgetown University Medical Center Washington, DC 20007

2 PIR – Integrated Protein Informatics Resource for Genomic/Proteomic Research ( New version of PIR homepage UniProt – Central international database of protein sequence and function (

3 Objective: Accurate, Consistent, and Rich Annotation of Protein Sequence and Function
Literature-Based Curation – Extract Reliable Information from Literature Function, domains/sites, developmental stages, catalytic activity, binding and modified residues, regulation, pathways, tissue specificity, subcellular location …... Ensure high quality, accurate and up-to-date experimental data for each protein. A major bottleneck! Ontologies/Controlled Vocabularies – For Information Integration and Knowledge Management UniProtKB entries will be annotated using widely accepted biological ontologies and other controlled vocabularies, e.g. Gene Ontology (GO) and EC nomenclature.

4 iProLINK: An integrated protein resource for literature mining and literature-based curation
1. Bibliography mapping - UniProt mapped citations 2. Annotation extraction - annotation tagged literature 3. Protein named entity recognition - dictionary, name tagged literature 4. Protein ontology development - PIRSF-based ontology

5 Testing and Benchmarking Dataset
iProLINK Testing and Benchmarking Dataset RLIMS-P text mining tool Protein dictionaries Name tagging guideline Protein ontology The exponential growth of large-scale molecular sequence data and of the PubMed scientific literature has prompted active research in biological literature mining and information extraction to facilitate genome/proteome annotation and improve the quality of biological databases. Motivated by the promise of text mining methodologies, but at the same time, the lack of adequate curated data for training and benchmarking, the PIR developed the iProLINK resource for protein literature mining and database curation. As PIR focuses its effort on the curation of the UniProt protein sequence database, the goal of iProLINK is to provide curated data sources that can be utilized for text mining research in the areas of bibliography mapping, annotation extraction, protein named entity recognition, and protein ontology development. The data sources for bibliography mapping and annotation extraction include mapped citations (PubMed ID to protein entry and feature line mapping) and annotation-tagged literature corpora. The latter includes several hundred abstracts and full-text articles tagged with experimentally validated post-translational modifications (PTMs) annotated in the PIR protein sequence database. The data sources for entity recognition and ontology development include a protein name dictionary, word token dictionaries, protein name-tagged literature corpora along with tagging guidelines, as well as a protein ontology based on PIRSF protein family names (Comput. Biol. Chem., 28: ).

6 Protein Phosphorylation Annotation Extraction
Manual tagging assisted with computational extraction Training sets of positive and negative samples Evidence attribution This is an example showing literature (abstract or full-length article) with manually tagged text evidence for biochemical features, and these tagged literature can be used for training and benchmarking computational algorithms for computational protein feature extraction (e.g. phosphorylation and other types of PTMs). RLIMS-P is one of such program for extracting protein phosphorylation information. RLIMS-P 3 objects

7 RLIMS-P Rule-based LIterature Mining System for Protein Phosphorylation
Sentence extraction Part of speech tagging Preprocessing Acronym detection Term recognition Entity Recognition Noun and verb group detection Other syntactic structure detection Phrase Detection Semantic Type Classification Nominal level relation Verbal level relation Relation Identification Abstracts Full-Length Texts Post-Processing Extracted Annotations Tagged Abstracts Motivation: A large volume of experimental data on protein phosphorylation is buried in the fast-growing PubMed literature. While of great value, such information is limited in databases due to the laborious process of literature-based curation. Computational literature mining holds promise to facilitate database curation. Results: A rule-based system, RLIMS-P (Rule-based LIterature Mining System for Protein Phosphorylation), was used to extract protein phosphorylation information from MEDLINE abstracts. An annotation-tagged literature corpus developed at PIR was used to evaluate the system for finding phosphorylation papers and extracting phosphorylation objects (kinases, substrates, and sites) from abstracts. RLIMS-P achieved a precision and recall of 91.4% and 96.4% for paper retrieval, and of 97.9% and 88.0% for extraction of substrates and sites. Coupling the high recall for paper retrieval and high precision for information extraction, RLIMS-P facilitates literature mining and database annotation of protein phosphorylation. Availability: The program is available on request from the authors. The phosphorylation patterns and data sets used in this study are available at - Submitted to Bioinformatics Pattern 1: <AGENT> <VG-active-phosphorylate> <THEME> (in/at <SITE>)? ATR/FRP-1 also phosphorylated p53 in Ser 15 download

8 Benchmarking of RLIMS-P
High recall for paper retrieval and high precision for information extraction Bioinformatics Jun 1;21(11): UniProtKB site feature annotation Proteomics Mass Spec. data analysis: protein identification

9 Online RLIMS-P (version 1.0) Search interface
1. 2. 3. Search interface Summary table with top hit of all sites All sites and tagged text evidence

10 BioThesaurus http://pir.georgetown.edu/iprolink/biothesaurus/
Raw Thesaurus iProClass NCBI Entrez Gene RefSeq GenPept UniProt UniProtKB UniRef90/50 PIR-PSD Genome FlyBase WormBase MGD SGD RGD Other HUGO EC OMIM Name Filtering Highly Ambiguous Nonsensical Terms Semantic Typing UMLS Name Extraction UniProtKB Entries: Protein/Gene Names & Synonyms BioThesaurus Biological entity tagging Name mapping Database annotation literature mining Gateway to other resources Applications: BioThesaurus v1.0 m = million # UniProtKB entry 1.86m # Source DB record 6.6m # Gene/protein names/terms 3.6m (May, 2005)

11 BioThesaurus Report Synonyms for Metalloproteinase inhibitor 3
Gene/Protein Name Mapping Search Synonyms Resolve Name Ambiguity Underlying ID Mapping 1 3 ID Mapping TMP3 Name ambiguity 2

12 Protein Name Tagging Tagging guideline versions 1.0 and 2.0
Generation of domain expert-tagged corpora Inter-coder reliability – upper bound of machine tagging Dictionary pre-tagging F-measure: (0.372 Precision, Recall) Advantages: helpful with standardization and extent of tagging, reducing the fatigue problem, and improve inter-coder reliability. BioThesaurus for pre-tagging

13 PIRSF-Based Protein Ontology
PIRSF family hierarchy based on evolutionary relationships Standardized PIRSF family names as hierarchical protein ontology DAG Network structure for PIRSF family classification system PIRSF in DAG View

14 DynGO viewer Hongfang Liu University of Maryland
PIRSF to GO Mapping Mapped 5363 curated PIRSF homeomorphic families and subfamilies to the GO hierarchy 68% of the PIRSF families and subfamilies map to GO leaf nodes 2329 PIRSFs have shared GO leaf nodes Complements GO: PIRSF-based ontology can be used to analyze GO branches and concepts and to provide links between the GO sub-ontologies Superimpose GO and PIRSF hierarchies Bidirectional display (GO- or PIRSF-centric views) DynGO viewer Hongfang Liu University of Maryland

15 Protein Ontology Can Complement GO
GO-centric view Expanding a Node: Identification of GO subtrees that can be expanded when GO concepts are too broad IGFBP subfamilies and High- vs. low-affinity binding for IGF between IGFBP and IGFBPrP

16 Exploration of Gene and Protein Ontology
PIRSF-centric view Molecular function Biological process Estrogen receptor alpha (PIRSF50001) Systematic links between three GO sub-ontologies, e.g., linking molecular function and biological process: Estrogen receptor binding Estrogen receptor signaling pathway To evaluate how PIRSF can enrich GO concepts, we have mapped 5500 PIRSF homeomorphic families and subfamilies to the GO hierarchy. We have also developed a viewer (DynGO) for superimposing both classification hierarchies with a bidirectional display showing either a GO-centric or a PIRSF-centric view to facilitate the exploration of GO and PIRSF relationship.

17 Summary PIR iProLINK literature mining resource provides annotated data sets for NLP research on annotation extraction and protein ontology development RLIMS-P text-mining tool for protein phosphorylation from PubMed literature. BioThesaurus can be used for name mapping to solve name synonym and ambiguity issues. PIRSF-based protein ontology can complement other biological ontologies such as GO.

18 Acknowledgements Research Projects Collaborators
NIH: NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR (UniProt) NSF: SEIII (Entity Tagging) NSF: ITR (Ontology) Collaborators I. Mani from Georgetown University Department of Linguistics on protein name recognition and protein name ontology. H. Liu from University of Maryland Department of Information System on protein name recognition and text mining. Vijay K. Shanker from University of Delaware Department of Computer and Information Science on text mining of protein phosphorylation features.


Download ppt "Literature Data Mining and Protein Ontology Development"

Similar presentations


Ads by Google