Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.

Similar presentations


Presentation on theme: "1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation."— Presentation transcript:

1 1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation extraction - annotation tagged literature 3. Protein named entity recognition - dictionary, name tagged literature 4. Protein ontology development - PIRSF-based ontology

2 2 Objective: Accurate, Consistent, and Rich Annotation of Protein Sequence and Function Literature-Based Curation – Extract Reliable Information from Literature Function, domains/sites, developmental stages, catalytic activity, binding and modified residues, regulation, pathways, tissue specificity, subcellular location …... Ensure high quality, accurate and up-to-date experimental data for each protein. A major bottleneck! Ontologies/Controlled Vocabularies – For Information Integration and Knowledge Management UniProtKB entries will be annotated using widely accepted biological ontologies and other controlled vocabularies, e.g. Gene Ontology (GO) and EC nomenclature.

3 3 Access to iProLINK homepage

4 4 iProLINK http://pir.georgetown.edu/iprolink/ Testing and Benchmarking Dataset RLIMS-P text mining tool Protein dictionaries Name tagging guideline Protein ontology

5 5 Protein Phosphorylation Annotation Extraction Manual tagging assisted with computational extraction Training sets of positive and negative samples RLIMS-P Evidence attribution 3 objects

6 6 RLIMS-P Rule-based LIterature Mining System for Protein Phosphorylation Sentence extraction Part of speech tagging Preprocessing Acronym detection Term recognition Entity Recognition Noun and verb group detection Other syntactic structure detection Phrase Detection Semantic Type Classification Nominal level relation Verbal level relation Relation Identification Abstracts Full-Length Texts Post- Processing Extracted Annotations Tagged Abstracts Pattern 1: (in/at )? ATR/FRP-1 also phosphorylated p53 in Ser 15 http://pir.georgetown.edu/iprolink/ download

7 7 Benchmarking of RLIMS-P UniProtKB site feature annotation Proteomics Mass Spec. data analysis: protein identification High recall for paper retrieval and high precision for information extraction Bioinformatics. 2005 Jun 1;21(11):2759-65

8 8 Online RLIMS-P http://pir.georgetown.edu/iprolink/rlimsp/ (version 1.0) Search interface Summary table with top hit of all sites All sites and tagged text evidence 1. 2. 3.

9 9 Raw Thesaurus iProClass NCBI Entrez Gene RefSeq GenPept UniProt UniProtKB UniRef90/5 0 PIR-PSD Genome FlyBase WormBase MGD SGD RGD Other HUGO EC OMIM Name Filtering Highly Ambiguous Nonsensical Terms Semantic Typing UMLS Name Extraction UniProtKB Entries: Protein/Gene Names & Synonyms BioThesaurus BioThesaurus http://pir.georgetown.edu/iprolink/biothesaurus/ Biological entity tagging Name mapping Database annotation literature mining Gateway to other resources Applications: # UniProtKB entry1.86m # Source DB record6.6m # Gene/protein names/terms3.6m BioThesaurus v1.0 m = million (May, 2005)

10 10 BioThesaurus Report 1 3 Synonyms for Metalloproteinase inhibitor 3 Gene/Protein Name Mapping 1. 1. Search Synonyms 2. 2. Resolve Name Ambiguity 3. 3. Underlying ID Mapping 2 ID Mapping Name ambiguity TMP3

11 11 Protein Name Tagging Tagging guideline versions 1.0 and 2.0 Generation of domain expert-tagged corpora Inter-coder reliability – upper bound of machine tagging Dictionary pre-tagging F-measure: 0.412 (0.372 Precision, 0.462 Recall) Advantages: helpful with standardization and extent of tagging, reducing the fatigue problem, and improve inter- coder reliability. BioThesaurus for pre-tagging

12 12 PIRSF family hierarchy based on evolutionary relationships Standardized PIRSF family names as hierarchical protein ontology DAG Network structure for PIRSF family classification system PIRSF-Based Protein Ontology PIRSF in DAG View

13 13 PIRSF to GO Mapping Mapped 5363 curated PIRSF homeomorphic families and subfamilies to the GO hierarchy 68% of the PIRSF families and subfamilies map to GO leaf nodes 2329 PIRSFs have shared GO leaf nodes Complements GO: PIRSF-based ontology can be used to analyze GO branches and concepts and to provide links between the GO sub-ontologies Superimpose GO and PIRSF hierarchies Bidirectional display (GO- or PIRSF-centric views) DynGO viewer Hongfang Liu University of Maryland

14 14 Protein Ontology Can Complement GO Expanding a Node: Identification of GO subtrees that can be expanded when GO concepts are too broad IGFBP subfamilies and High- vs. low-affinity binding for IGF between IGFBP and IGFBPrP GO-centric view

15 15 Exploration of Gene and Protein Ontology PIRSF-centric view Molecular function Biological process Estrogen receptor alpha (PIRSF50001) Systematic links between three GO sub-ontologies, e.g., linking molecular function and biological process: Estrogen receptor binding Estrogen receptor signaling pathway

16 16 Summary PIR iProLINK literature mining resource provides annotated data sets for NLP research on annotation extraction and protein ontology development RLIMS-P text-mining tool for protein phosphorylation from PubMed literature. BioThesaurus can be used for name mapping to solve name synonym and ambiguity issues. PIRSF-based protein ontology can complement other biological ontologies such as GO.

17 17 Acknowledgements Research Projects NIH: NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR (UniProt) NSF: SEIII (Entity Tagging) NSF: ITR (Ontology) Collaborators I. Mani from Georgetown University Department of Linguistics on protein name recognition and protein name ontology.Department of Linguisticsprotein name ontology H. Liu from University of Maryland Department of Information System on protein name recognition and text mining.Department of Information System Vijay K. Shanker from University of Delaware Department of Computer and Information Science on text mining of protein phosphorylation features.Department of Computer and Information Science


Download ppt "1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation."

Similar presentations


Ads by Google