Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using text mining techniques to support the expansion of controlled vocabularies Irena Spasić

Similar presentations


Presentation on theme: "Using text mining techniques to support the expansion of controlled vocabularies Irena Spasić"— Presentation transcript:

1 Using text mining techniques to support the expansion of controlled vocabularies Irena Spasić i.spasic@manchester.ac.uk http://www.cbr-masterclass.org/

2 Project title: Population of a taxonomy referring to NMR techniques based on evidence automatically extracted from the scientific literature timeline: 06-Nov-2006 to 17-Nov-2006 partners:Irena Spasić 1,2, MCISB Daniel Schober 2, EBI Dietrich Rebholz-Schuhmann 1, EBI Susanna-Assunta Sansone 2, EBI 1 text mining, 2 MSI Ontology WG funding:Semantic Mining Network of Excellence, EU Information Society Technologies 6 th Framework Programme

3 Metabolomics Society http://www.metabolomicssociety.org founded in 2004 the most recent community-wide initiative to coordinate the efforts in standardising reporting structures of metabolomic experiments 5 WGs founded to cover the key areas for describing metabolomic experiments: –biological sample context –chemical analysis –data analysis –ontology –data exchange

4 MSI OWG Metabolomics Standardisation Initiative Ontology WG http://msi-ontology.sourceforge.net msi-workgroups-ontology@lists.sourceforge.net goal: consistent semantic annotation of metabolomics experiments to enable the community to consistently interpret and integrate their data across disparate electronic resources (software tools and databases) the OWG will tackle the semantics issue by: reaching a consensus on a core set of controlled vocabularies (CVs) and developing a corresponding ontology work coordinated by Dr Susanna-Assunta Sansone

5 Controlled vocabulary a list of terms, which are used to tag units of information so that they may be more easily retrieved by a search improves technical communication by ensuring that everyone is using the same term to mean the same thing the terms are chosen and organized by trained professionals who possess expertise in the subject area wood forest plant

6 Ontology an explicit conceptualisation of a domain through a set of concepts, their definitions and relations between them (Uschold, 1996) the purpose of an ontology is to provide effective means of communication within a domain, which can be between humans and/or computer systems Altman et al. (1999) emphasise the communication aspect: ontologies are scientific models that support clear communication between users, and, on the other hand, store information in a structured form, thus providing support for automated processing

7 MSI OWG to facilitate the development, the OWG has divided the CVs coverage into two main components: –the general experimental component (e.g. design, sample characteristics, treatments) –the technology-dependant subcomponents (e.g. nuclear magnetic resonance, mass spectrometry, chromatography) strategy: 1.build a seed CV for each subcomponent manually 2.expand each CV semi-automatically using text mining 3.integrate the CV terms into the overall ontology

8 Current focus nuclear magnetic resonance (NMR) spectroscopy NMR = a technique which exploits the magnetic properties of nuclei in order to identify atom environments, and in some cases the number of atoms of each type, within a sample important in metabolomics because of the ability to observe mixtures of small molecules in cells and their extracts three main topics in the NMR ontology: method, instrument & protocol (covering the experimental parameters)

9 Current status of the NMR CV/ontology the seed CV collected by the members of the MSI OWG an ontology for NMR is under development the initial ontology compiled by Dr Daniel Schober the ontology currently contains around 250 hand-picked NMR-related terms it is expected to collect a total of around 1K terms in order to complete the ontology

10 Current status of the NMR CV/ontology the ontology is available in the OWL format http://www.w3.org/TR/owl-features/ listed under OBO (Open Biomedical Ontologies) OBO = an umbrella web address for well-structured controlled vocabularies for shared use across different biological and medical domains http://obo.sourceforge.net/

11 Current work on the NMR CV/ontology using text mining to expand the coverage of the CV extracting currently unidentified NMR-related terms from the relevant literature text mining work done by Dr Irena Spasić in collaboration with Dr Dietrich Rebholz-Schuhmann

12 1 st step: information retrieval in order to ensure the completeness of the NMR ontology, we propose a text mining approach over a relevant corpus of documents, which can be: –abstracts –full papers* (especially the Material and Methods sections) where available relevant resources: –MEDLINE (abstracts) http://www.nlm.nih.gov/pubs/factsheets/medline.html –PubMed Central (full papers) http://www.pubmedcentral.gov/

13 Information retrieval (IR) in order to retrieve the relevant documents, a few approaches may be used and preferably combined: –identifying a relevant set of MeSH terms –using the terms currently described in the ontology as search terms –collecting an initial corpus from domain experts

14 IR using MeSH terms MeSH = Medical Subject Headings http://www.nlm.nih.gov/mesh/ MeSH is the NLM's CV used for indexing articles for MEDLINE/PubMed MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts

15 IR using MeSH terms finding the relevant MeSH terms using the MeSH browser http://www.nlm.nih.gov/mesh/MBrowser.html look up: NMR resulting MeSH term(s): Magnetic Resonance Spectroscopy PubMed query: Magnetic Resonance Spectroscopy [MeSH Terms] returns: –119,589 abstracts from MEDLINE –5,905 full papers from PMC

16 IR of full papers NMR (or any other analytical technique used in metabolomics) is rarely itself the focus of a metabolomics study  it is expected only for the results discovered to be reported in an abstract and not for the experimental conditions leading to these results the experimental conditions are typically reported within “Materials & Methods” sections or as part of the supplementary material  it is important to process the full text articles as opposed to abstracts only as a consequence, an IR approach based on MeSH terms or search terms limited to abstracts will result in a low recall (i.e. many of the relevant articles will be overlooked) NMR    MEDLINE (abstracts) PubMed Central (full papers) biomedical literature

17 IR of full papers objective: to increase the recall and obtain papers that describe research that utilises the NMR technology, but which do not deal with NMR per se and therefore are not indexed by NMR-related MeSH terms approach: use NMR ontology terms as search terms over full-text articles

18 IR of full papers: strategy 1.for each term obtain the number of papers returned from PMC 2.sort the terms by the number of paper they return; set a cut-off point to remove the terms that return too many papers, as they are likely to be broad terms not limited to NMR and therefore would introduce a lot of noise 3.for each remaining term retrieve the PMC IDs of the papers they retrieve 4.sort the PMC IDs according to the number of times they are retrieved; set a cut-off point to remove the ones that do not contain a sufficient number of known NMR terms 5.retrieve the full papers from PMC

19 IR: selecting the search terms 2400

20 IR: selecting the documents PMC ID number of matching terms > threshold PMC local corpus = 3

21 Automatic term recognition (ATR) C-value – domain independent ATR method, which combines linguistic knowledge and statistical analysis linguistic part: –used as a filter to select term candidates –includes part-of-speech tagging, syntactic pattern matching and a stop list statistical part: –used to estimate the termhood of candidate terms –includes frequency of occurrence, frequency of nested occurrence, length

22 C-value: linguistic part part-of-speech (POS) tagging is the process of tagging the words in a text as corresponding to a particular part of speech (e.g. noun, verb, adjective) based on its definition and a particular context (i.e. in relation to adjacent and related words) beta/ADJ isoforms/N of/PREP glucocorticoid/N receptor/N the POS information is used later during syntactic pattern matching, which is used to extract only those words sequences that conform to certain syntactic rules the patterns used in the C-value method describe the typical inner structure of terms, e.g. (ADJ | N) + | ((ADJ | N)* [N PREP] (ADJ | N)*) N

23 C-value: statistical analysis termhood of each candidate term t is calculated using: –|t|its length as the number of words –f(t) its frequency of occurrence –S(t) the set of other candidate terms containing it as a subphrase

24 Using the C-value method http://www.nactem.ac.uk/batch.php

25 Problem C-value extracts statistically significant terms C-value does not differentiate between the “NMR terms” and the terms representing the study subjects in which NMR is used only as an analytical technique, but itself is not the focus of a study we need to reduce the number of terms not directly related to the NMR technique

26 C-value results

27 A solution the initial inspection of automatically extracted terms revealed the main types of concepts studied using NMR: substances, organisms, organs, conditions/diseases… a straightforward approach to filtering out such terms is to use the existing dictionaries of these terms and match it against the list of automatically extracted terms

28 Unified Medical Language System (UMLS) UMLS = an “ontology” which merges information from over 100 biomedical source vocabularies http://umlsks.nlm.nih.gov UMLS contains the following semantic classes relevant to our problem: Organism A.1.1 Anatomical Structure A.1.2 Substance A.1.4 Biological Function B.2.2.1 Injury or Poisoning B.2.3 we used these classes to automatically extract the corresponding terms from the UMLS thesaurus

29 Provisional NMR terms the terms extracted by the C-value method which contain any of the chosen UMLS terms are removed from the final list of “NMR terms” in this manner, 88 terms have been extracted from abstracts and passed on to the curators

30 Summary Entrez Utilities Web Service UMLS

31 The End

32 Motivation NMR    MEDLINE (abstracts) PubMed Central (full papers) biomedical literature


Download ppt "Using text mining techniques to support the expansion of controlled vocabularies Irena Spasić"

Similar presentations


Ads by Google