1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.

Slides:



Advertisements
Similar presentations
Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR) Hu ZZ 1, Mani I 2, Liu H 3, Vijay-Shanker K 4, Hermoso.
Advertisements

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Modeling Functional Genomics Datasets CVM Lesson 3 13 June 2007Fiona McCarthy.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
Ontology annotation: mapping genomic regions biological function Paul D Thomas, Huaiyu Mi and Suzanna Lewis.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Scientific publications and archives: media, content and access Lesk, Ch 3 (Lesk, 2008)
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Protein Databases EBI – European Bioinformatics Institute
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Bioinformatics & LIS A brief talk for librarians, information scientists, and computer scientists about resources and collaborative opportunities with.
Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Biological Data Integration July 22, 2003 GTL Data and Tools Workshop Gaithersburg, MD Cathy H. Wu, Ph.D. Professor of Biochemistry & Molecular Biology.
An Introduction to Bioinformatics Molecular Biology Databases.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
1 Protein Bioinformatics – Advances and Challenges Sona Vasudevan Peter McGarvey BY.
1 Bio-Trac 25 (Proteomics: Principles and Methods) October 5, 2007 Zhang-Zhi Hu, M.D. Research Associate Professor Protein Information Resource, Department.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
1 Bio-Trac 25 (Proteomics: Principles and Methods) October 3, 2008 Zhang-Zhi Hu, M.D. Research Associate Professor Protein Information Resource, Department.
Copyright OpenHelix. No use or reproduction without express written consent1.
RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation Hu ZZ 1, Yuan X 1, Torii M 2, Vijay-Shanker K 3, and Wu CH 1 1 Protein Information.
Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,
1 Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of Biochemistry and Molecular.
Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis.
Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center Washington, DC PIR: a comprehensive resource for functional.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
What is an Ontology? An ontology is a specification of a conceptualization that is designed for reuse across multiple applications and implementations.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein Information Resource Protein Information Resource, 3300 Whitehaven St., Georgetown University, Washington, DC Contact
The Gene Ontology and its insertion into UMLS Jane Lomax.
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Copyright OpenHelix. No use or reproduction without express written consent1.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Copyright OpenHelix. No use or reproduction without express written consent1.
MedKAT Medical Knowledge Analysis Tool December 2009.
Scope of the Gene Ontology Vocabularies. Compile structured vocabularies describing aspects of molecular biology Describe gene products using vocabulary.
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
InterPro Sandra Orchard.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Text mining activities at PIR Cecilia Arighi March 12, 2013.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Protein databases Henrik Nielsen
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
UniProt: Universal Protein Resource
Annotation: linking literature to gene products
PIR: Protein Information Resource
Literature Data Mining and Protein Ontology Development
Tutorial: Bioinformatics Resources
Presentation transcript:

1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation extraction - annotation tagged literature 3. Protein named entity recognition - dictionary, name tagged literature 4. Protein ontology development - PIRSF-based ontology

2 Objective: Accurate, Consistent, and Rich Annotation of Protein Sequence and Function Literature-Based Curation – Extract Reliable Information from Literature Function, domains/sites, developmental stages, catalytic activity, binding and modified residues, regulation, pathways, tissue specificity, subcellular location …... Ensure high quality, accurate and up-to-date experimental data for each protein. A major bottleneck! Ontologies/Controlled Vocabularies – For Information Integration and Knowledge Management UniProtKB entries will be annotated using widely accepted biological ontologies and other controlled vocabularies, e.g. Gene Ontology (GO) and EC nomenclature.

3 Access to iProLINK homepage

4 iProLINK Testing and Benchmarking Dataset RLIMS-P text mining tool Protein dictionaries Name tagging guideline Protein ontology

5 Protein Phosphorylation Annotation Extraction Manual tagging assisted with computational extraction Training sets of positive and negative samples RLIMS-P Evidence attribution 3 objects

6 RLIMS-P Rule-based LIterature Mining System for Protein Phosphorylation Sentence extraction Part of speech tagging Preprocessing Acronym detection Term recognition Entity Recognition Noun and verb group detection Other syntactic structure detection Phrase Detection Semantic Type Classification Nominal level relation Verbal level relation Relation Identification Abstracts Full-Length Texts Post- Processing Extracted Annotations Tagged Abstracts Pattern 1: (in/at )? ATR/FRP-1 also phosphorylated p53 in Ser 15 download

7 Benchmarking of RLIMS-P UniProtKB site feature annotation Proteomics Mass Spec. data analysis: protein identification High recall for paper retrieval and high precision for information extraction Bioinformatics Jun 1;21(11):

8 Online RLIMS-P (version 1.0) Search interface Summary table with top hit of all sites All sites and tagged text evidence

9 Raw Thesaurus iProClass NCBI Entrez Gene RefSeq GenPept UniProt UniProtKB UniRef90/5 0 PIR-PSD Genome FlyBase WormBase MGD SGD RGD Other HUGO EC OMIM Name Filtering Highly Ambiguous Nonsensical Terms Semantic Typing UMLS Name Extraction UniProtKB Entries: Protein/Gene Names & Synonyms BioThesaurus BioThesaurus Biological entity tagging Name mapping Database annotation literature mining Gateway to other resources Applications: # UniProtKB entry1.86m # Source DB record6.6m # Gene/protein names/terms3.6m BioThesaurus v1.0 m = million (May, 2005)

10 BioThesaurus Report 1 3 Synonyms for Metalloproteinase inhibitor 3 Gene/Protein Name Mapping Search Synonyms Resolve Name Ambiguity Underlying ID Mapping 2 ID Mapping Name ambiguity TMP3

11 Protein Name Tagging Tagging guideline versions 1.0 and 2.0 Generation of domain expert-tagged corpora Inter-coder reliability – upper bound of machine tagging Dictionary pre-tagging F-measure: (0.372 Precision, Recall) Advantages: helpful with standardization and extent of tagging, reducing the fatigue problem, and improve inter- coder reliability. BioThesaurus for pre-tagging

12 PIRSF family hierarchy based on evolutionary relationships Standardized PIRSF family names as hierarchical protein ontology DAG Network structure for PIRSF family classification system PIRSF-Based Protein Ontology PIRSF in DAG View

13 PIRSF to GO Mapping Mapped 5363 curated PIRSF homeomorphic families and subfamilies to the GO hierarchy 68% of the PIRSF families and subfamilies map to GO leaf nodes 2329 PIRSFs have shared GO leaf nodes Complements GO: PIRSF-based ontology can be used to analyze GO branches and concepts and to provide links between the GO sub-ontologies Superimpose GO and PIRSF hierarchies Bidirectional display (GO- or PIRSF-centric views) DynGO viewer Hongfang Liu University of Maryland

14 Protein Ontology Can Complement GO Expanding a Node: Identification of GO subtrees that can be expanded when GO concepts are too broad IGFBP subfamilies and High- vs. low-affinity binding for IGF between IGFBP and IGFBPrP GO-centric view

15 Exploration of Gene and Protein Ontology PIRSF-centric view Molecular function Biological process Estrogen receptor alpha (PIRSF50001) Systematic links between three GO sub-ontologies, e.g., linking molecular function and biological process: Estrogen receptor binding Estrogen receptor signaling pathway

16 Summary PIR iProLINK literature mining resource provides annotated data sets for NLP research on annotation extraction and protein ontology development RLIMS-P text-mining tool for protein phosphorylation from PubMed literature. BioThesaurus can be used for name mapping to solve name synonym and ambiguity issues. PIRSF-based protein ontology can complement other biological ontologies such as GO.

17 Acknowledgements Research Projects NIH: NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR (UniProt) NSF: SEIII (Entity Tagging) NSF: ITR (Ontology) Collaborators I. Mani from Georgetown University Department of Linguistics on protein name recognition and protein name ontology.Department of Linguisticsprotein name ontology H. Liu from University of Maryland Department of Information System on protein name recognition and text mining.Department of Information System Vijay K. Shanker from University of Delaware Department of Computer and Information Science on text mining of protein phosphorylation features.Department of Computer and Information Science