Using text mining techniques to support the expansion of controlled vocabularies Irena Spasić

Slides:



Advertisements
Similar presentations
I. Spasić,1 D. Schober,2 S. Sansone,2 D. Rebholz-Schuhmann,2 D
Advertisements

Chapter 5: Introduction to Information Retrieval
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Who am I Gianluca Correndo PhD student (end of PhD) Work in the group of medical informatics (Paolo Terenziani) PhD thesis on contextualization techniques.
NATIONAL LIBRARY OF MEDICINE The PubMed ID and Entrez, PubMed and PubMed Central Edwin Sequeira National Center for Biotechnology Information June 21,
Article Review Study Fulltext vs Metadata Searching Brad Hemminger School of Information and Library Science University of North Carolina.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
1 CIS607, Fall 2006 Semantic Information Integration Instructor: Dejing Dou Week 10 (Nov. 29)
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Špindlerův Mlýn, Czech Republic, SOFSEM Semantically-aided Data-aware Service Workflow Composition Ondrej Habala, Marek Paralič,
Query Relevance Feedback and Ontologies How to Make Queries Better.
Search Engines and Information Retrieval Chapter 1.
Rutherford Appleton Laboratory SKOS Ecoterm 2006 Alistair Miles CCLRC Rutherford Appleton Laboratory Semantic Web Best Practices and Deployment.
BME1450: Biomaterials and Biomedical Research Michelle Baratta Engineering & Computer Science Library Maria Buda Dentistry Library.
Survey of Semantic Annotation Platforms
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
H. Lundbeck A/S3-Oct-151 Assessing the effectiveness of your current search and retrieval function Anna G. Eslau, Information Specialist, H. Lundbeck A/S.
Bio-Medical Information Retrieval from Net By Sukhdev Singh.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
The Agricultural Ontology Service (AOS) A Tool for Facilitating Access to Knowledge AGRIS/CARIS and Documentation Group Library and Documentation Systems.
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
BioSumm A novel summarizer oriented to biological information Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio Politecnico di Torino Introduction text.
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
1 DATA PRESENTATION AND SEASONAL ADJUSTMENT - DATA AND METADATA PRESENTATION TERMINOLOGY - DATA PRESENTATION AND SEASONAL ADJUSTMENT - DATA AND METADATA.
IAEA International Atomic Energy Agency International Nuclear Information System (INIS) INIS SUBJECT ANALYSIS: Subject Indexing INIS Training Seminar
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Mining the Biomedical Research Literature Ken Baclawski.
MedKAT Medical Knowledge Analysis Tool December 2009.
Information Retrieval
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
Evidence Based Practice (EBP) Riphah College of Rehabilitation Sciences(RCRS) Riphah International University Islamabad.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Semantics and the EPA System of Registries Gail Hodge IIa/ Consultant to the U.S. Environmental Protection Agency 18 April 2007.
MEDLINE®/PubMed® PubMed for Trainers, Fall 2015 U.S. National Library of Medicine (NLM) and NLM Training Center An introduction.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
The Agricultural Ontology Server (AOS) A Tool for Facilitating Access to Knowledge AGRIS/CARIS and Documentation Group Food and Agriculture Organization.
Informatics for Scientific Data Bio-informatics and Medical Informatics Week 9 Lecture notes INF 380E: Perspectives on Information.
Innovative Novartis Knowledge Center
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
Ricardo EIto Brun Strasbourg, 5 Nov 2015
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Development of the Amphibian Anatomical Ontology
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Supplementary Table 1. PRISMA checklist
Terminology problems in literature mining and NLP
Multimedia Information Retrieval
PubMed.
Presentation transcript:

Using text mining techniques to support the expansion of controlled vocabularies Irena Spasić

Project title: Population of a taxonomy referring to NMR techniques based on evidence automatically extracted from the scientific literature timeline: 06-Nov-2006 to 17-Nov-2006 partners:Irena Spasić 1,2, MCISB Daniel Schober 2, EBI Dietrich Rebholz-Schuhmann 1, EBI Susanna-Assunta Sansone 2, EBI 1 text mining, 2 MSI Ontology WG funding:Semantic Mining Network of Excellence, EU Information Society Technologies 6 th Framework Programme

Metabolomics Society founded in 2004 the most recent community-wide initiative to coordinate the efforts in standardising reporting structures of metabolomic experiments 5 WGs founded to cover the key areas for describing metabolomic experiments: –biological sample context –chemical analysis –data analysis –ontology –data exchange

MSI OWG Metabolomics Standardisation Initiative Ontology WG goal: consistent semantic annotation of metabolomics experiments to enable the community to consistently interpret and integrate their data across disparate electronic resources (software tools and databases) the OWG will tackle the semantics issue by: reaching a consensus on a core set of controlled vocabularies (CVs) and developing a corresponding ontology work coordinated by Dr Susanna-Assunta Sansone

Controlled vocabulary a list of terms, which are used to tag units of information so that they may be more easily retrieved by a search improves technical communication by ensuring that everyone is using the same term to mean the same thing the terms are chosen and organized by trained professionals who possess expertise in the subject area wood forest plant

Ontology an explicit conceptualisation of a domain through a set of concepts, their definitions and relations between them (Uschold, 1996) the purpose of an ontology is to provide effective means of communication within a domain, which can be between humans and/or computer systems Altman et al. (1999) emphasise the communication aspect: ontologies are scientific models that support clear communication between users, and, on the other hand, store information in a structured form, thus providing support for automated processing

MSI OWG to facilitate the development, the OWG has divided the CVs coverage into two main components: –the general experimental component (e.g. design, sample characteristics, treatments) –the technology-dependant subcomponents (e.g. nuclear magnetic resonance, mass spectrometry, chromatography) strategy: 1.build a seed CV for each subcomponent manually 2.expand each CV semi-automatically using text mining 3.integrate the CV terms into the overall ontology

Current focus nuclear magnetic resonance (NMR) spectroscopy NMR = a technique which exploits the magnetic properties of nuclei in order to identify atom environments, and in some cases the number of atoms of each type, within a sample important in metabolomics because of the ability to observe mixtures of small molecules in cells and their extracts three main topics in the NMR ontology: method, instrument & protocol (covering the experimental parameters)

Current status of the NMR CV/ontology the seed CV collected by the members of the MSI OWG an ontology for NMR is under development the initial ontology compiled by Dr Daniel Schober the ontology currently contains around 250 hand-picked NMR-related terms it is expected to collect a total of around 1K terms in order to complete the ontology

Current status of the NMR CV/ontology the ontology is available in the OWL format listed under OBO (Open Biomedical Ontologies) OBO = an umbrella web address for well-structured controlled vocabularies for shared use across different biological and medical domains

Current work on the NMR CV/ontology using text mining to expand the coverage of the CV extracting currently unidentified NMR-related terms from the relevant literature text mining work done by Dr Irena Spasić in collaboration with Dr Dietrich Rebholz-Schuhmann

1 st step: information retrieval in order to ensure the completeness of the NMR ontology, we propose a text mining approach over a relevant corpus of documents, which can be: –abstracts –full papers* (especially the Material and Methods sections) where available relevant resources: –MEDLINE (abstracts) –PubMed Central (full papers)

Information retrieval (IR) in order to retrieve the relevant documents, a few approaches may be used and preferably combined: –identifying a relevant set of MeSH terms –using the terms currently described in the ontology as search terms –collecting an initial corpus from domain experts

IR using MeSH terms MeSH = Medical Subject Headings MeSH is the NLM's CV used for indexing articles for MEDLINE/PubMed MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts

IR using MeSH terms finding the relevant MeSH terms using the MeSH browser look up: NMR resulting MeSH term(s): Magnetic Resonance Spectroscopy PubMed query: Magnetic Resonance Spectroscopy [MeSH Terms] returns: –119,589 abstracts from MEDLINE –5,905 full papers from PMC

IR of full papers NMR (or any other analytical technique used in metabolomics) is rarely itself the focus of a metabolomics study  it is expected only for the results discovered to be reported in an abstract and not for the experimental conditions leading to these results the experimental conditions are typically reported within “Materials & Methods” sections or as part of the supplementary material  it is important to process the full text articles as opposed to abstracts only as a consequence, an IR approach based on MeSH terms or search terms limited to abstracts will result in a low recall (i.e. many of the relevant articles will be overlooked) NMR    MEDLINE (abstracts) PubMed Central (full papers) biomedical literature

IR of full papers objective: to increase the recall and obtain papers that describe research that utilises the NMR technology, but which do not deal with NMR per se and therefore are not indexed by NMR-related MeSH terms approach: use NMR ontology terms as search terms over full-text articles

IR of full papers: strategy 1.for each term obtain the number of papers returned from PMC 2.sort the terms by the number of paper they return; set a cut-off point to remove the terms that return too many papers, as they are likely to be broad terms not limited to NMR and therefore would introduce a lot of noise 3.for each remaining term retrieve the PMC IDs of the papers they retrieve 4.sort the PMC IDs according to the number of times they are retrieved; set a cut-off point to remove the ones that do not contain a sufficient number of known NMR terms 5.retrieve the full papers from PMC

IR: selecting the search terms 2400

IR: selecting the documents PMC ID number of matching terms > threshold PMC local corpus = 3

Automatic term recognition (ATR) C-value – domain independent ATR method, which combines linguistic knowledge and statistical analysis linguistic part: –used as a filter to select term candidates –includes part-of-speech tagging, syntactic pattern matching and a stop list statistical part: –used to estimate the termhood of candidate terms –includes frequency of occurrence, frequency of nested occurrence, length

C-value: linguistic part part-of-speech (POS) tagging is the process of tagging the words in a text as corresponding to a particular part of speech (e.g. noun, verb, adjective) based on its definition and a particular context (i.e. in relation to adjacent and related words) beta/ADJ isoforms/N of/PREP glucocorticoid/N receptor/N the POS information is used later during syntactic pattern matching, which is used to extract only those words sequences that conform to certain syntactic rules the patterns used in the C-value method describe the typical inner structure of terms, e.g. (ADJ | N) + | ((ADJ | N)* [N PREP] (ADJ | N)*) N

C-value: statistical analysis termhood of each candidate term t is calculated using: –|t|its length as the number of words –f(t) its frequency of occurrence –S(t) the set of other candidate terms containing it as a subphrase

Using the C-value method

Problem C-value extracts statistically significant terms C-value does not differentiate between the “NMR terms” and the terms representing the study subjects in which NMR is used only as an analytical technique, but itself is not the focus of a study we need to reduce the number of terms not directly related to the NMR technique

C-value results

A solution the initial inspection of automatically extracted terms revealed the main types of concepts studied using NMR: substances, organisms, organs, conditions/diseases… a straightforward approach to filtering out such terms is to use the existing dictionaries of these terms and match it against the list of automatically extracted terms

Unified Medical Language System (UMLS) UMLS = an “ontology” which merges information from over 100 biomedical source vocabularies UMLS contains the following semantic classes relevant to our problem: Organism A.1.1 Anatomical Structure A.1.2 Substance A.1.4 Biological Function B Injury or Poisoning B.2.3 we used these classes to automatically extract the corresponding terms from the UMLS thesaurus

Provisional NMR terms the terms extracted by the C-value method which contain any of the chosen UMLS terms are removed from the final list of “NMR terms” in this manner, 88 terms have been extracted from abstracts and passed on to the curators

Summary Entrez Utilities Web Service UMLS

The End

Motivation NMR    MEDLINE (abstracts) PubMed Central (full papers) biomedical literature